📚 https://arxiv.org/abs/2403.15377
🏆 Published in CVPR 2024 (Best Paper Honorable Mention)
📄 InternVideo2 – Scaling Video Foundation Models for Multimodal Understanding
✨ Key Contributions
- Introduces a unified multimodal video-language foundation model integrating visual, textual, and auditory modalities.
- Proposes a progressive training framework combining masked video modeling, contrastive learning, and next-token prediction.
- Achieves state-of-the-art performance on over 60 benchmarks, demonstrating strong generalization and scalability.
- Scales up to 6B parameters, trained on 400M video-text pairs, showing consistent performance gains with model size.
🎯 Problem Definition
- Prior vision-language models (e.g., CLIP, BLIP-2) handle static images well but struggle with temporal coherence in videos.
- Existing methods use fragmented objectives—contrastive, reconstruction, or captioning—without a unified structure.
- The challenge: Build a scalable and cohesive video foundation model capable of understanding motion, sound, and language together.
🧠 Method / Architecture
- Progressive Multimodal Training
- Stage 1: Masked video modeling (MVM) for spatial-temporal representation learning.
- Stage 2: Multimodal contrastive learning aligning video–text–audio–speech features.
- Stage 3: Next-token prediction for generative reasoning and contextual understanding.
- Model Design
- Hierarchical temporal encoder for long-term motion modeling.
- Unified transformer backbone handling multimodal fusion.
- Large-scale distributed training using DeepSpeed and FlashAttention.
🧪 Experiments & Results
- Trained on 400M multimodal pairs from diverse video datasets.
- Evaluated on retrieval, captioning, recognition, and dialogue benchmarks.
- Outperformed previous models across all 60+ tasks, confirming that multimodal supervision (especially audio and speech) enhances semantic comprehension.
🚫 Limitations
- Extremely resource-intensive; pretraining demands massive compute and data resources.
- Current framework optimized for large-scale tasks; small dataset adaptation remains limited.
- Multimodal alignment still relies on extensive labeled data.
🔭 Future Ideas
- Develop efficient pretraining techniques for smaller datasets.
- Explore adapter-based finetuning to extend flexibility to domain-specific tasks.
- Integrate reinforcement or human feedback for richer temporal and contextual understanding.
🔁 Personal Reflections
- InternVideo2 shows that video foundation models are following the same scaling laws as LLMs and CLIP.
- Its unified framework elegantly bridges reconstruction, alignment, and generation in a single pipeline.
- The study reinforces a clear trend: scaling + multimodality are the future of video understanding.