📚 https://arxiv.org/abs/2403.15377
🏆 Published in CVPR 2024 (Best Paper Honorable Mention)

📄 InternVideo2 – Scaling Video Foundation Models for Multimodal Understanding

✨ Key Contributions

Introduces a unified multimodal video-language foundation model integrating visual, textual, and auditory modalities.
Proposes a progressive training framework combining masked video modeling, contrastive learning, and next-token prediction.
Achieves state-of-the-art performance on over 60 benchmarks, demonstrating strong generalization and scalability.
Scales up to 6B parameters, trained on 400M video-text pairs, showing consistent performance gains with model size.

Prior vision-language models (e.g., CLIP, BLIP-2) handle static images well but struggle with temporal coherence in videos.
Existing methods use fragmented objectives—contrastive, reconstruction, or captioning—without a unified structure.
The challenge: Build a scalable and cohesive video foundation model capable of understanding motion, sound, and language together.

Progressive Multimodal Training
- Stage 1: Masked video modeling (MVM) for spatial-temporal representation learning.
- Stage 2: Multimodal contrastive learning aligning video–text–audio–speech features.
- Stage 3: Next-token prediction for generative reasoning and contextual understanding.
Model Design
- Hierarchical temporal encoder for long-term motion modeling.
- Unified transformer backbone handling multimodal fusion.
- Large-scale distributed training using DeepSpeed and FlashAttention.

Trained on 400M multimodal pairs from diverse video datasets.
Evaluated on retrieval, captioning, recognition, and dialogue benchmarks.
Outperformed previous models across all 60+ tasks, confirming that multimodal supervision (especially audio and speech) enhances semantic comprehension.

Extremely resource-intensive; pretraining demands massive compute and data resources.
Current framework optimized for large-scale tasks; small dataset adaptation remains limited.
Multimodal alignment still relies on extensive labeled data.

Develop efficient pretraining techniques for smaller datasets.
Explore adapter-based finetuning to extend flexibility to domain-specific tasks.
Integrate reinforcement or human feedback for richer temporal and contextual understanding.

InternVideo2 shows that video foundation models are following the same scaling laws as LLMs and CLIP.
Its unified framework elegantly bridges reconstruction, alignment, and generation in a single pipeline.
The study reinforces a clear trend: scaling + multimodality are the future of video understanding.