Attentions Really is all you need - The Encoder

21-10-2025
Building a Neural Network From Scratch - Just Numpy
Transformers · Deep Learning

Attention Really Is
All You Need

A deep dive into the transformer architecture - the Encoder. From sequence models to multi-head attention and layer normalisation, all from first principles.

By Sai Saketh  ·  October 21, 2025  ·  10 min read

Introduction

In this article we're going to go into a deep dive on the transformer architecture based on the paper "Attention is all you need". And since it's quite a long paper with many important concepts, we're going to make it two parts, the Encoder being this first, and the Decoder and training/inference being in the second.

For some context, "Attention Is All You Need" is a 2017 landmark research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism proposed in 2014 by Bahdanau et al. It is considered a foundational paper in modern artificial intelligence, and a main contributor to the AI boom, as the transformer approach has become the main architecture of a wide variety of AI, such as large language models.

Now let's dive in!

The Rise of Sequence Models

Before Transformers revolutionized deep learning, models like Recurrent Neural Networks (RNNs) and LSTMs dominated the field. They processed data sequentially, which meant each step depended on the previous one.

How RNNs work

RNNs take one token at a time, combine it with the hidden state from the previous step, and produce an output token. However, this sequential dependency introduces several critical limitations.

Figure 1 - Recurrent Neural Networks (RNN)
State <0> RNN RNN RNN . . . RNN Y1 Y2 Y3 YN X1 X2 X3 XN Time Step 1 Time Step 2 Time Step 3 Time Step N
  • 1
    Slow Computation. Like running a loop, each step waits for the previous one. Training long sequences becomes computationally expensive.
  • 2
    Vanishing / Exploding Gradients. When the same gradients are multiplied repeatedly through many steps, they may become tiny (vanishing), updates barely happen or the become huge (exploding), causing weights to overshoot during training.
  • 3
    Long-term Dependencies. As information moves through many time steps, the model forgets older context. When translating long sentences, the beginning often gets "lost."

These problems set the stage for something faster, parallel, and more powerful: the Transformer.

Enter the Transformer

Introduced in "Attention Is All You Need" (Vaswani et al., 2017), the Transformer eliminated recurrence entirely. Instead, it uses attention mechanisms that let each word "look at" all other words in parallel.

The architecture has two main components: the Encoder which reads and understands the input and the Decoder, which generates the output (like a translation). Let's look into each component in detail.

Figure 2 - The Transformer Architecture (Vaswani et al., 2017)
Add & Norm Feed Forward Add & Norm Multi-Head Attention Positional Encoding Input Embedding Inputs to decoder → Add & Norm Feed Forward Add & Norm Multi-Head Attention Add & Norm Masked Multi-Head Attention Positional Encoding Output Embedding Outputs (shifted right) Linear Output Probabilities Softmax Output Probabilities

Input Embedder

The input embedder is the first step in the Transformer's encoder. It converts raw text into numerical representations that the model can understand. The input embedder follows these steps:

  • 1
    Tokenization. The original sentence is split into smaller units called tokens. These could be words, subwords, or even characters.
  • 2
    Numerical Mapping. Each token is assigned a unique number based on its position in the vocabulary. For example, the word "love" might correspond to the index 3452.
  • 3
    Embedding Representation. Each token index is then mapped to a dense embedding vector of size 512, often denoted as d_model = 512. This vector captures the semantic meaning of the token. Words with similar meanings will have similar vector representations.
  • 4
    Trainable Parameters. These embedding vectors are not fixed; they are learned and updated during training based on the loss function. This allows the model to refine how it represents each word's meaning in context.
Figure 3 - The Flow of the Input Embedder
YOUR CAT IS A LOVELY CAT 105 6587 5475 3578 65 6587 952.207 5450.840 1853.448 1.658 2671.529 171.411 3276.350 9192.819 3633.421 8390.473 621.659 1304.051 0.565 7679.805 4506.025 776.562 5567.288 58.942 2716.194 5119.949 6422.693 6315.080 9358.778 2141.081 735.147 171.411 3276.350 9192.819 3633.421 8390.473 Embedding (vector size 512) We define d_model = 512, which represents the size of the embedding vector of each word

In short, the input embedder transforms discrete words into continuous numerical vectors that carry rich contextual meaning, preparing them for further processing in the Transformer.

Positional Encoder

Positional encoding injects position information so the model can tell which tokens are near each other and which are far apart. For example, in the sentence "your cat is a lovely cat", a human can tell that "your" is "far" from "lovely" but the Transformer cannot on its own. Enter the positional encoder.

We create a positional vector for each position in the sequence. Each positional vector has the same dimensionality as the token embedding d_model = 512. To compute the positional vectors we use the classic trigonometric method. For position pos and dimension index i (0-based):

Figure 4 - The Computation of the Positional Encoding Tuple
PE (pos, 2i) = sin pos 10000 2i d model PE (pos, 2i+1) = cos pos 10000 2i d model

This uses sin on even indices and cos on odd indices at different frequencies, producing smooth, continuous patterns across positions.

Figure 5 - The Positional Encoder in Action: Same PE Reused Across Sentences
PE(pos, 2i) = sin pos 10000 2i/d PE(pos, 2i+1) = cos pos 10000 2i/d Sentence 1 YOUR CAT IS PE(0, 0) PE(0, 1) PE(0, 2) PE(0, 510) PE(0, 511) PE(1, 0) PE(1, 1) PE(1, 2) PE(1, 510) PE(1, 511) PE(2, 0) PE(2, 1) PE(2, 2) PE(2, 510) PE(2, 511) Sentence 2 I LOVE YOU PE(0, 0) PE(0, 1) PE(0, 2) PE(0, 510) PE(0, 511) PE(1, 0) PE(1, 1) PE(1, 2) PE(1, 510) PE(1, 511) PE(2, 0) PE(2, 1) PE(2, 2) PE(2, 510) PE(2, 511) Positional encodings are computed once and reused for every sentence during training and inference.
Figure 6 - The Positional Encoder in Action: Embedding + Position = Encoder Input
YOUR CAT IS A NICE CAT Embedding (d_model=512) 952.207 5450.840 1853.448 1.658 2671.529 171.411 3276.350 9192.819 3633.421 8390.473 621.659 1304.051 0.565 7679.805 4506.025 776.562 5567.288 58.942 2716.194 5119.949 6422.693 6315.080 9358.778 2141.081 735.147 171.411 3276.350 9192.819 8390.473 3633.421 + + + + + + Position Embedding (d_model=512) computed once, reused always. PE(0,0) PE(0,1) PE(0,2) PE(0,510) PE(0,511) PE(1,0) PE(1,1) PE(1,2) PE(1,510) PE(1,511) PE(2,0) PE(2,1) PE(2,2) PE(2,510) PE(2,511) PE(3,0) PE(3,1) PE(3,2) PE(3,510) PE(3,511) PE(4,0) PE(4,1) PE(4,2) PE(4,510) PE(4,511) PE(5,0) PE(5,1) PE(5,2) PE(5,510) PE(5,511) = = = = = = Encoder Input (d_model=512) 1835.479 11356.483 11813.218 1452.869 10105.789 Embedding (vector of size 512)

Now we add positional vectors to embeddings. For each token, take its learned embedding vector (size d_model) and add the positional vector for that token's position. The result is a combined vector that contains both meaning (embedding) and position (positional encoding). The Final encoder input is the sum of both vectors.

Why trigonometric functions? Trigonometric functions like cos and sin naturally represent a continuous pattern that the model can recognise so relative positions are easier to perceive. By watching the plot of these functions, we can observe a regular pattern, so we can hypothesize that the model will see it too.

Multi-Headed Attention

Before going into multi-headed attention, let's look at self-attention to better appreciate what multi-head attention adds.

Self-Attention

Self-attention is the mechanism that lets every token in a sequence relate to every other token, so each token's representation becomes aware of the whole sentence. In this simple case we consider the sequence length seq = 6 and d_model = d_k = 512. Each token "looks at" the other tokens, measures how relevant they are, and then gathers information from them to produce a new representation.

Figure 7 - Formula for Self-Attention
Attention(Q,K,V) = softmax ( QK T √d k ) V

We form three matrices Q (queries), K (keys), and V (values) from the input embeddings, each of shape (6, 512). We multiply Q by Kᵀ to get pairwise similarity scores, divide by √d_k to stabilize gradients, apply softmax so each row sums to 1, then multiply by V.

Figure 8 - Self-Attention: Q × Kᵀ / √512 → Softmax → (6,6) Attention Matrix
softmax Q (6, 512) × K T (512, 6) √512 = YOUR CAT IS A NICE CAT Σ YOUR 0.268 0.119 0.134 0.148 0.179 0.152 1 CAT 0.124 0.278 0.201 0.128 0.154 0.115 1 IS 0.147 0.132 0.262 0.097 0.218 0.145 1 A 0.210 0.128 0.206 0.212 0.119 0.125 1 NICE 0.146 0.158 0.152 0.143 0.227 0.174 1 CAT 0.195 0.114 0.203 0.103 0.157 0.229 1 (6, 6) * all values are random.
Figure 9 - Multiply Softmax Result by V to Get Attention Output (6, 512)
YOUR CAT IS A NICE CAT YOUR 0.268 0.119 0.134 0.148 0.179 0.152 CAT 0.124 0.278 0.201 0.128 0.154 0.115 IS 0.147 0.132 0.262 0.097 0.218 0.145 A 0.210 0.128 0.206 0.212 0.119 0.125 NICE 0.146 0.158 0.152 0.143 0.227 0.174 CAT 0.195 0.114 0.203 0.103 0.157 0.229 (6, 6) × V (6, 512) = Attention (6, 512)

Each entry in the matrix after multiplication indicates a score of how intense a relationship one word has with another. Perform the matrix multiplication by hand and you'll see why each row captures not only the meaning (given in the embedding), the position in the sentence (represented by the positional encodings) but also each word's interaction with all other words.

Multi-Head Attention

Instead of computing a single attention operation, we split Q, K, V into h heads and run attention in each smaller subspace in parallel. Each head watches the full sentence but a different aspect of the embedding of each word - because a word may have different meanings depending on context.

Figure 10 - Multi-Head Attention: Project → Split → h Heads → Concat → W_O → MH-A
Input (seq, d_model) Q (seq, d_m) × W Q (d_m, d_m) = Q' (seq, d_m) d_model Q1 Q2 Q3 Q4 K (seq, d_m) × W K (d_m, d_m) = K' (seq, d_m) K1 K2 K3 K4 V (seq, d_m) × W V (d_m, d_m) = V' (seq, d_m) V1 V2 V3 V4 d_k softmax(QKᵀ / √d_k) HEAD1 HEAD2 HEAD3 HEAD4 seq H (seq, h·d_k) × W O (h·d_k, d_m) = MHA (seq, d_m) Attention(Q,K,V) = softmax(QKᵀ / √d_k)V head_i = Attention(QW_i^Q, KW_i^K, VW_i^V) MultiHead(Q,K,V) = Concat(head₁ … head_h)W^O
  • 1
    Start with the input matrix of shape (seq, d_model). Copy it three times: one proceeds to Add & Norm later, and three copies form Q, K, V inputs.
  • 2
    Project Q, K, V by multiplying with learned matrices W_Q, W_K, W_V each of shape (d_model, d_model), producing Q', K', V' of shape (seq, d_model).
  • 3
    Split into h heads along the embedding dimension. Each head has dimension d_k = d_model / h. This yields Q₁, Q₂, . . . , Q_h (and similarly K_i, V_i), each of shape (seq, d_k).
  • 4
    For each head i, compute scaled dot-product attention: softmax((Q_i K_i^T) / √d_k) V_i, producing Head_i of shape (seq, d_k). Each head encodes a different aspect of token relationships.
  • 5
    Concatenate all h heads to form Concat(head₁,…,head_h) of shape (seq, h * d_k). Since h * d_k = d_model, this restores shape (seq, d_model).
  • 6
    Apply final linear projection W_O of shape (d_model, d_model) to produce the MH-A output of shape (seq, d_model).

Add and Norm

Moving on to the last part of the multi-head attention block, we reach the Add & Norm layer. It performs two key functions:

Add (Residual Connection): The output of a sublayer (such as Multi-Head Attention or Feed Forward Network) is added to its original input. This helps prevent loss of information and mitigates the vanishing gradient problem, allowing gradients to flow easily through deep networks.

Norm (Layer Normalization): After adding the residual connection, the result is normalized across its features to ensure that the data has a stable distribution. This makes learning faster and more reliable by keeping activations within a consistent range.

Figure 11 - Calculating the Mean and Variance of Each Item Independently
Batch of 3 items ITEM 1 50.147 3314.825 8463.361 8.021 μ₁ σ₁² ITEM 2 1242.223 688.123 434.944 149.442 μ₂ σ₂² ITEM 3 9.370 4606.674 944.705 21189.444 μ₃ σ₃² Each item's mean and variance computed independently and not across the batch
  • 1
    Take the input x and the sublayer output sublayer(x). Add them to form the residual: y = x + sublayer(x).
  • 2
    Compute the mean μ and variance σ² of all features for each token embedding in y.
  • 3
    Normalize: ŷ = (y − μ) / √(σ² + ε), where ε is a small constant preventing division by zero.
  • 4
    Scale and shift using learnable parameters: LayerNorm(yᵢ) = γ · ŷᵢ + β. γ (gamma) and β (beta) introduce controlled fluctuations, the network learns to stress when one word is more intense than another.
Why add gamma and beta? These introduce some fluctuations in the data because having all data between 0–1 may be too restrictive for the network. The network learns to tune these two parameters to introduce fluctuations when necessary, essentially using them to stress when a word is more intense than another.

That's all for Part 1!

It was a long one, take some time to re read it to fully grasp the beautiful design of the encoder. We'll pick up right where we left off with the Decoder in the next part.

Credit to Umar Jamil for his fantastic video on the "Attention Is All You Need" paper and for the accompanying visuals. His explanation made these concepts much easier to understand. Be sure to check out his video yourself!