๐ https://arxiv.org/abs/1706.03762
๐ Published in NeurIPS 2017
โ
Day 5 โ Training & Results
๐ Optimizer & Learning Rate Schedule
- The authors used Adam optimizer with slightly unusual settings (tuned to make training stable).
- Instead of fixing the learning rate, they used a warmup strategy:
- At first, the learning rate gradually increases (so the model doesnโt โshockโ itself with big updates).
- After a certain point, it slowly decreases, letting the model fine-tune without overshooting.
- This schedule made training more reliable and efficient.
๐ Regularization
- Dropout (0.1): randomly ignores some parts of the network during training โ prevents overfitting.
- Label smoothing (0.1): instead of teaching the model that the โcorrectโ answer is 100% right and others 0%, it softens the labels a bit.
- This encourages the model to stay flexible and not become overconfident.
- As a result, it generalizes better to unseen data.
๐ Training Details
- Batch size: trained on very large groups of tokens (25,000 at once), which speeds up learning.
- Hardware: used 8 powerful NVIDIA P100 GPUs.
- Time: surprisingly fast โ only about 12 hours to reach top performance.
- Model sizes:
- Base model: ~65 million parameters
- Big model: ~213 million parameters, with wider and deeper layers
๐ Results โ Translation Benchmarks
- English โ German (WMT 2014): achieved 28.4 BLEU, about +2.0 better than the best previous system.
- English โ French (WMT 2014): achieved 41.8 BLEU, setting a new single-model state of the art.
- These results showed the Transformer could outperform RNNs and CNNs not just in speed, but also in quality.
๐ Additional Experiments
- The Transformer also worked well on English constituency parsing, a task very different from translation.
- This proved the modelโs versatility: itโs not limited to one domain.
๐ Key Takeaways (Day 5)
- Smart training schedule (warmup + gradual decay) stabilized learning.
- Dropout and label smoothing prevented overfitting and boosted generalization.
- Training was efficient โ world-class results in just 12 hours.
- The Transformer demonstrated it could be a scalable, general-purpose architecture, not just for translation.
๐ง Final Thoughts (Day 5)
Day 5 tied everything together: the paper not only introduced a new architecture, but also showed how careful training strategies and regularization tricks made it both efficient and powerful. Itโs clear why the Transformer quickly became the foundation for todayโs NLP (and beyond).