π https://arxiv.org/abs/1706.03762
π Published in NeurIPS 2017
β
Day 3 β Multi-Head Attention
π Motivation
- A single attention head may capture only one type of relationship.
- Multi-Head Attention lets the model look at different subspaces and positions simultaneously.
π Mechanism
- Instead of one attention function, the model projects Queries, Keys, and Values multiple times with different learned weights.
- Each head runs Scaled Dot-Product Attention independently.
- The outputs of all heads are concatenated and then linearly transformed again.
π Dimensions
- In the paper: number of heads = 8.
- Each head works on smaller dimensions (64 per head), which keeps computation cost similar to a single large attention.
π Benefits
- Diversity: each head can focus on different cues (syntax, semantics, positional).
- Efficiency: splitting into smaller heads reduces the computation burden.
- Expressiveness: combining multiple heads enriches the final representation.
π Key Takeaways (Day 3)
- Multi-Head Attention = many attentions in parallel.
- Captures multiple dependencies at once.
- Core component for the Transformerβs success.
β
Day 4 β Feed-Forward Networks & Positional Encoding
π Feed-Forward Networks
- Each encoder/decoder layer has a fully connected feed-forward network after attention.
- Applied independently at every position, but parameters are shared across all positions.
- In the paper: input/output size = 512, inner hidden size = 2048.
- Purpose: adds non-linear transformations, giving the model more capacity.
π Positional Encoding
- Transformers donβt have recurrence or convolution β they need a way to know token order.
- Solution: add positional encodings to embeddings.
- Defined with sine and cosine functions of different frequencies.
- Provides both absolute position information and relative distance cues.
π Why Sinusoidal?
- Generalizes to sequences longer than those seen during training.
- Smoothly represents positions and distances.
- Simple, parameter-free design that works effectively.
π Key Takeaways (Day 4)
- FFN: boosts model power with extra transformation at each position.
- Positional encoding: injects order into an attention-only model.
- Sinusoidal design: elegant and effective for long sequences.
π§ Final Thoughts (Day 3 & 4)
- Multi-Head Attention enriches relationships and is the backbone of the Transformer.
- Feed-Forward Networks add essential non-linearity.
- Positional Encoding elegantly solves the order problem.
Next, Iβll study the training strategies and optimization details described in the paper.