Attentions Really is all you need - The Encoder
Attention Really Is
All You Need
A deep dive into the transformer architecture - the Encoder. From sequence models to multi-head attention and layer normalisation, all from first principles.
Table of Contents
Introduction
In this article we're going to go into a deep dive on the transformer architecture based on the paper "Attention is all you need". And since it's quite a long paper with many important concepts, we're going to make it two parts, the Encoder being this first, and the Decoder and training/inference being in the second.
For some context, "Attention Is All You Need" is a 2017 landmark research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism proposed in 2014 by Bahdanau et al. It is considered a foundational paper in modern artificial intelligence, and a main contributor to the AI boom, as the transformer approach has become the main architecture of a wide variety of AI, such as large language models.
Now let's dive in!
The Rise of Sequence Models
Before Transformers revolutionized deep learning, models like Recurrent Neural Networks (RNNs) and LSTMs dominated the field. They processed data sequentially, which meant each step depended on the previous one.
How RNNs work
RNNs take one token at a time, combine it with the hidden state from the previous step, and produce an output token. However, this sequential dependency introduces several critical limitations.
- 1Slow Computation. Like running a loop, each step waits for the previous one. Training long sequences becomes computationally expensive.
- 2Vanishing / Exploding Gradients. When the same gradients are multiplied repeatedly through many steps, they may become tiny (vanishing), updates barely happen or the become huge (exploding), causing weights to overshoot during training.
- 3Long-term Dependencies. As information moves through many time steps, the model forgets older context. When translating long sentences, the beginning often gets "lost."
These problems set the stage for something faster, parallel, and more powerful: the Transformer.
Enter the Transformer
Introduced in "Attention Is All You Need" (Vaswani et al., 2017), the Transformer eliminated recurrence entirely. Instead, it uses attention mechanisms that let each word "look at" all other words in parallel.
The architecture has two main components: the Encoder which reads and understands the input and the Decoder, which generates the output (like a translation). Let's look into each component in detail.
Input Embedder
The input embedder is the first step in the Transformer's encoder. It converts raw text into numerical representations that the model can understand. The input embedder follows these steps:
- 1Tokenization. The original sentence is split into smaller units called tokens. These could be words, subwords, or even characters.
- 2Numerical Mapping. Each token is assigned a unique number based on its position in the vocabulary. For example, the word "love" might correspond to the index 3452.
- 3Embedding Representation. Each token index is then mapped to a dense embedding vector of size 512, often denoted as
d_model = 512. This vector captures the semantic meaning of the token. Words with similar meanings will have similar vector representations. - 4Trainable Parameters. These embedding vectors are not fixed; they are learned and updated during training based on the loss function. This allows the model to refine how it represents each word's meaning in context.
In short, the input embedder transforms discrete words into continuous numerical vectors that carry rich contextual meaning, preparing them for further processing in the Transformer.
Positional Encoder
Positional encoding injects position information so the model can tell which tokens are near each other and which are far apart. For example, in the sentence "your cat is a lovely cat", a human can tell that "your" is "far" from "lovely" but the Transformer cannot on its own. Enter the positional encoder.
We create a positional vector for each position in the sequence. Each positional vector has the same dimensionality as the token embedding d_model = 512. To compute the positional vectors we use the classic trigonometric method. For position pos and dimension index i (0-based):
This uses sin on even indices and cos on odd indices at different frequencies, producing smooth, continuous patterns across positions.
Now we add positional vectors to embeddings. For each token, take its learned embedding vector (size d_model) and add the positional vector for that token's position. The result is a combined vector that contains both meaning (embedding) and position (positional encoding). The Final encoder input is the sum of both vectors.
Multi-Headed Attention
Before going into multi-headed attention, let's look at self-attention to better appreciate what multi-head attention adds.
Self-Attention
Self-attention is the mechanism that lets every token in a sequence relate to every other token, so each token's representation becomes aware of the whole sentence. In this simple case we consider the sequence length seq = 6 and d_model = d_k = 512. Each token "looks at" the other tokens, measures how relevant they are, and then gathers information from them to produce a new representation.
We form three matrices Q (queries), K (keys), and V (values) from the input embeddings, each of shape (6, 512). We multiply Q by Kᵀ to get pairwise similarity scores, divide by √d_k to stabilize gradients, apply softmax so each row sums to 1, then multiply by V.
Each entry in the matrix after multiplication indicates a score of how intense a relationship one word has with another. Perform the matrix multiplication by hand and you'll see why each row captures not only the meaning (given in the embedding), the position in the sentence (represented by the positional encodings) but also each word's interaction with all other words.
Multi-Head Attention
Instead of computing a single attention operation, we split Q, K, V into h heads and run attention in each smaller subspace in parallel. Each head watches the full sentence but a different aspect of the embedding of each word - because a word may have different meanings depending on context.
- 1Start with the input matrix of shape
(seq, d_model). Copy it three times: one proceeds to Add & Norm later, and three copies form Q, K, V inputs. - 2Project Q, K, V by multiplying with learned matrices W_Q, W_K, W_V each of shape
(d_model, d_model), producing Q', K', V' of shape(seq, d_model). - 3Split into h heads along the embedding dimension. Each head has dimension
d_k = d_model / h. This yields Q₁, Q₂, . . . , Q_h (and similarly K_i, V_i), each of shape(seq, d_k). - 4For each head i, compute scaled dot-product attention:
softmax((Q_i K_i^T) / √d_k) V_i, producing Head_i of shape(seq, d_k). Each head encodes a different aspect of token relationships. - 5Concatenate all h heads to form
Concat(head₁,…,head_h)of shape(seq, h * d_k). Since h * d_k = d_model, this restores shape(seq, d_model). - 6Apply final linear projection W_O of shape
(d_model, d_model)to produce the MH-A output of shape(seq, d_model).
Add and Norm
Moving on to the last part of the multi-head attention block, we reach the Add & Norm layer. It performs two key functions:
Add (Residual Connection): The output of a sublayer (such as Multi-Head Attention or Feed Forward Network) is added to its original input. This helps prevent loss of information and mitigates the vanishing gradient problem, allowing gradients to flow easily through deep networks.
Norm (Layer Normalization): After adding the residual connection, the result is normalized across its features to ensure that the data has a stable distribution. This makes learning faster and more reliable by keeping activations within a consistent range.
- 1Take the input
xand the sublayer outputsublayer(x). Add them to form the residual:y = x + sublayer(x). - 2Compute the mean μ and variance σ² of all features for each token embedding in y.
- 3Normalize:
ŷ = (y − μ) / √(σ² + ε), where ε is a small constant preventing division by zero. - 4Scale and shift using learnable parameters:
LayerNorm(yᵢ) = γ · ŷᵢ + β. γ (gamma) and β (beta) introduce controlled fluctuations, the network learns to stress when one word is more intense than another.
That's all for Part 1!
It was a long one, take some time to re read it to fully grasp the beautiful design of the encoder. We'll pick up right where we left off with the Decoder in the next part.
Credit to Umar Jamil for his fantastic video on the "Attention Is All You Need" paper and for the accompanying visuals. His explanation made these concepts much easier to understand. Be sure to check out his video yourself!