๐ https://arxiv.org/abs/2010.11929
๐ Published in ICLR 2021
โจ Key Contributions
- Proposed the first pure Transformer architecture for large-scale image recognition.
- Demonstrated that Transformers, when trained on very large datasets, can match or outperform CNNs.
- Showed that scaling laws from NLP also apply in vision tasks.
๐ฏ Problem Definition
- CNNs have dominated vision tasks but rely heavily on inductive biases (locality, translation equivariance).
- Scaling CNNs further provides diminishing returns.
- Question: Can a convolution-free architecture (Transformer) handle vision tasks effectively if trained on enough data?
๐ง Method / Architecture
- Represent images as a sequence of patches (e.g., 16ร16 pixels).
- Flatten each patch and project it into embeddings.
- Feed the sequence into a Transformer encoder, similar to NLP.
- Add a special [CLS] token for classification.
- Use positional embeddings to retain spatial information.
๐งช Experiments & Results
- Initial experiments show that ViT needs large-scale pretraining (e.g., JFT-300M) to perform competitively.
- With sufficient data, ViT achieves state-of-the-art performance on benchmarks like ImageNet.
๐ซ Limitations
- Requires huge datasets and significant compute resources for training.
- Underperforms CNNs when trained only on smaller datasets without pretraining.
๐ญ Future Ideas
- Explore data-efficient training methods for ViT.
- Combine CNN inductive biases with Transformers (e.g., hybrid models).
- Extend to broader CV tasks: detection, segmentation, video understanding.
๐ Personal Reflections
- Fascinating to see how a convolution-free approach can rival CNNs.
- Reinforces the importance of scaling in deep learning.
- ViT feels like a turning point where vision starts catching up with NLP in architectural unification.