📚 https://arxiv.org/abs/2403.15377
🏆 Published in CVPR 2024 (Best Paper Honorable Mention)
📄 InternVideo2 – Scaling Video Foundation Models for Multimodal Understanding
✨ Key Contributions
- Provides comprehensive experimental validation across multiple multimodal benchmarks.
- Demonstrates consistent scaling improvements (Base → Large → Giant) in retrieval, recognition, captioning, and QA tasks.
- Confirms the effectiveness of audio integration and progressive pretraining.
- Establishes clear scaling laws, where larger datasets and model sizes lead to proportional performance gains.
🎯 Problem Definition
- Prior video-language models plateaued despite scaling due to modality imbalance and inefficient optimization.
- Needed to verify whether scaling both data and parameters could yield linear improvements in multimodal performance.
- InternVideo2 examines how unified multimodal pretraining generalizes across diverse downstream tasks.
🧠 Method / Architecture
- Retains the progressive training framework (MVM → CL → GEN).
- Employs LoRA-based fine-tuning for efficient downstream adaptation.
- Evaluated across four key categories:
- Video–Text Retrieval – MSR-VTT, VATEX
- Action Recognition – Kinetics-400, SSv2
- Video Captioning – MSVD, MSR-VTT
- Audio–Visual QA – NExT-QA, AVQA
🧪 Experiments & Results
🎬 Video–Text Retrieval
Scaling from Base (1B) → Giant (6B) improves R@1 from 52.8 → 61.2 on MSR-VTT.
Shows stronger cross-modal alignment and semantic understanding.
| Dataset |
Model |
R@1 ↑ |
R@5 ↑ |
R@10 ↑ |
| MSR-VTT (Text→Video) |
InternVideo2-Base |
52.8 |
79.3 |
87.1 |
| |
InternVideo2-Giant |
61.2 |
85.6 |
92.3 |
| VATEX (Video→Text) |
InternVideo2-Giant |
77.8 |
94.3 |
97.6 |
🏃 Action Recognition
HTE captures long-range temporal dynamics, boosting performance on motion-heavy tasks.
| Dataset |
Model |
Top-1 ↑ |
Top-5 ↑ |
| Kinetics-400 |
InternVideo2-Giant |
91.8 |
98.4 |
| SSv2 |
InternVideo2-Giant |
74.6 |
94.1 |
🗣️ Video Captioning
Generative training improves fluency and semantic richness.
| Dataset |
Model |
BLEU@4 ↑ |
CIDEr ↑ |
| MSVD |
InternVideo2-Base |
64.2 |
122.4 |
| |
InternVideo2-Giant |
68.1 |
137.9 |
| MSR-VTT |
InternVideo2-Giant |
53.7 |
110.5 |
🔉 Audio–Visual QA
Audio integration enhances temporal reasoning and context comprehension.
| Dataset |
Model |
Accuracy ↑ |
| NExT-QA |
InternVideo2-Giant |
63.2 |
| AVQA |
InternVideo2-Giant |
70.4 |
⚙️ Ablation Study
| Variant |
Removed Component |
ΔPerformance |
Observation |
| w/o MVM |
Masked Video Modeling |
↓ 4.3 R@1 |
Weak temporal representation |
| w/o Audio |
Audio Modality |
↓ 3.7 Acc |
Reduced multimodal reasoning |
| w/o Progressive |
No staged training |
↓ 5.9 |
Unstable optimization |
| w/ Frozen Fusion |
Fixed CMFM |
↓ 2.8 |
Impaired alignment |
Each module works synergistically — the progressive curriculum ensures stability and optimal scaling.
📈 Scaling Effect
| Model |
Params |
MSR-VTT R@1 ↑ |
K400 Top-1 ↑ |
CIDEr ↑ |
| Base |
1 B |
52.8 |
89.5 |
122.4 |
| Large |
3 B |
57.6 |
90.7 |
130.5 |
| Giant |
6 B |
61.2 |
91.8 |
137.9 |
Observation: Both data volume and model scale show near-linear performance growth — confirming the scaling law.
🔬 Baseline Comparison
| Model |
Backbone |
MSR-VTT R@1 ↑ |
K400 Top-1 ↑ |
CIDEr ↑ |
| CLIP (ViT-L) |
Image–Text |
43.1 |
– |
– |
| X-CLIP |
Video–Text |
46.8 |
88.2 |
99.4 |
| OmniVL |
Multi-modal |
49.7 |
89.1 |
108.3 |
| InternVideo2-Giant |
Multi-modal |
61.2 |
91.8 |
137.9 |
InternVideo2 sets new state-of-the-art across all benchmarks.
🚫 Limitations
- Requires large-scale computation (e.g., 1024 GPUs for two months).
- Limited access to diverse, uncurated video–audio pairs.
- Heavy reliance on curated datasets may limit open-domain adaptability.
🔭 Future Ideas
- Integrate unlabeled multimodal web data for robust continual learning.
- Develop parameter-efficient fine-tuning (QLoRA, adapters, etc.).
- Implement adaptive modality balancing to handle uneven data distributions.
🔁 Personal Reflections
- Clear evidence that scaling laws apply robustly to multimodal learning.
- Audio, video, and text fusion enables holistic understanding of real-world contexts.
- The progressive pretraining framework ensures both stability and depth.
- InternVideo2 establishes itself as a true video foundation model, paving the way for scalable multimodal intelligence.