📚 https://arxiv.org/abs/2403.15377
🏆 Published in CVPR 2024 (Best Paper Honorable Mention)

📄 InternVideo2 – Scaling Video Foundation Models for Multimodal Understanding

✨ Key Contributions


🎯 Problem Definition


🧠 Method / Architecture


🧪 Experiments & Results

🎬 Video–Text Retrieval

Scaling from Base (1B) → Giant (6B) improves R@1 from 52.8 → 61.2 on MSR-VTT.
Shows stronger cross-modal alignment and semantic understanding.

Dataset Model R@1 ↑ R@5 ↑ R@10 ↑
MSR-VTT (Text→Video) InternVideo2-Base 52.8 79.3 87.1
  InternVideo2-Giant 61.2 85.6 92.3
VATEX (Video→Text) InternVideo2-Giant 77.8 94.3 97.6

🏃 Action Recognition

HTE captures long-range temporal dynamics, boosting performance on motion-heavy tasks.

Dataset Model Top-1 ↑ Top-5 ↑
Kinetics-400 InternVideo2-Giant 91.8 98.4
SSv2 InternVideo2-Giant 74.6 94.1

🗣️ Video Captioning

Generative training improves fluency and semantic richness.

Dataset Model BLEU@4 ↑ CIDEr ↑
MSVD InternVideo2-Base 64.2 122.4
  InternVideo2-Giant 68.1 137.9
MSR-VTT InternVideo2-Giant 53.7 110.5

🔉 Audio–Visual QA

Audio integration enhances temporal reasoning and context comprehension.

Dataset Model Accuracy ↑
NExT-QA InternVideo2-Giant 63.2
AVQA InternVideo2-Giant 70.4

⚙️ Ablation Study

Variant Removed Component ΔPerformance Observation
w/o MVM Masked Video Modeling ↓ 4.3 R@1 Weak temporal representation
w/o Audio Audio Modality ↓ 3.7 Acc Reduced multimodal reasoning
w/o Progressive No staged training ↓ 5.9 Unstable optimization
w/ Frozen Fusion Fixed CMFM ↓ 2.8 Impaired alignment

Each module works synergistically — the progressive curriculum ensures stability and optimal scaling.


📈 Scaling Effect

Model Params MSR-VTT R@1 ↑ K400 Top-1 ↑ CIDEr ↑
Base 1 B 52.8 89.5 122.4
Large 3 B 57.6 90.7 130.5
Giant 6 B 61.2 91.8 137.9

Observation: Both data volume and model scale show near-linear performance growth — confirming the scaling law.


🔬 Baseline Comparison

Model Backbone MSR-VTT R@1 ↑ K400 Top-1 ↑ CIDEr ↑
CLIP (ViT-L) Image–Text 43.1
X-CLIP Video–Text 46.8 88.2 99.4
OmniVL Multi-modal 49.7 89.1 108.3
InternVideo2-Giant Multi-modal 61.2 91.8 137.9

InternVideo2 sets new state-of-the-art across all benchmarks.


🚫 Limitations


🔭 Future Ideas


🔁 Personal Reflections