Building a Neural Network From Scratch - Just Numpy

Transformers · Deep Learning

Attention Really Is
All You Need

A deep dive into the transformer architecture - the Encoder. From sequence models to multi-head attention and layer normalisation, all from first principles.

By Sai Saketh · October 21, 2025 · 10 min read

Table of Contents

Introduction
The Rise of Sequence Models
Enter the Transformer
Input Embedder
Positional Encoder
Multi-Headed Attention
Add & Norm

Introduction

In this article we're going to go into a deep dive on the transformer architecture based on the paper "Attention is all you need". And since it's quite a long paper with many important concepts, we're going to make it two parts, the Encoder being this first, and the Decoder and training/inference being in the second.

For some context, "Attention Is All You Need" is a 2017 landmark research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism proposed in 2014 by Bahdanau et al. It is considered a foundational paper in modern artificial intelligence, and a main contributor to the AI boom, as the transformer approach has become the main architecture of a wide variety of AI, such as large language models.

Now let's dive in!

The Rise of Sequence Models

Before Transformers revolutionized deep learning, models like Recurrent Neural Networks (RNNs) and LSTMs dominated the field. They processed data sequentially, which meant each step depended on the previous one.

How RNNs work

RNNs take one token at a time, combine it with the hidden state from the previous step, and produce an output token. However, this sequential dependency introduces several critical limitations.

Figure 1 - Recurrent Neural Networks (RNN)

1
Slow Computation. Like running a loop, each step waits for the previous one. Training long sequences becomes computationally expensive.
2
Vanishing / Exploding Gradients. When the same gradients are multiplied repeatedly through many steps, they may become tiny (vanishing), updates barely happen or the become huge (exploding), causing weights to overshoot during training.
3
Long-term Dependencies. As information moves through many time steps, the model forgets older context. When translating long sentences, the beginning often gets "lost."

These problems set the stage for something faster, parallel, and more powerful: the Transformer.

Enter the Transformer

Introduced in "Attention Is All You Need" (Vaswani et al., 2017), the Transformer eliminated recurrence entirely. Instead, it uses attention mechanisms that let each word "look at" all other words in parallel.

The architecture has two main components: the Encoder which reads and understands the input and the Decoder, which generates the output (like a translation). Let's look into each component in detail.

Figure 2 - The Transformer Architecture (Vaswani et al., 2017)

Input Embedder

The input embedder is the first step in the Transformer's encoder. It converts raw text into numerical representations that the model can understand. The input embedder follows these steps:

1
Tokenization. The original sentence is split into smaller units called tokens. These could be words, subwords, or even characters.
2
Numerical Mapping. Each token is assigned a unique number based on its position in the vocabulary. For example, the word "love" might correspond to the index 3452.
3
Embedding Representation. Each token index is then mapped to a dense embedding vector of size 512, often denoted as d_model = 512. This vector captures the semantic meaning of the token. Words with similar meanings will have similar vector representations.
4
Trainable Parameters. These embedding vectors are not fixed; they are learned and updated during training based on the loss function. This allows the model to refine how it represents each word's meaning in context.

Figure 3 - The Flow of the Input Embedder

In short, the input embedder transforms discrete words into continuous numerical vectors that carry rich contextual meaning, preparing them for further processing in the Transformer.

Positional Encoder

Positional encoding injects position information so the model can tell which tokens are near each other and which are far apart. For example, in the sentence "your cat is a lovely cat", a human can tell that "your" is "far" from "lovely" but the Transformer cannot on its own. Enter the positional encoder.

We create a positional vector for each position in the sequence. Each positional vector has the same dimensionality as the token embedding d_model = 512. To compute the positional vectors we use the classic trigonometric method. For position pos and dimension index i (0-based):

Figure 4 - The Computation of the Positional Encoding Tuple

This uses sin on even indices and cos on odd indices at different frequencies, producing smooth, continuous patterns across positions.

Figure 5 - The Positional Encoder in Action: Same PE Reused Across Sentences

Figure 6 - The Positional Encoder in Action: Embedding + Position = Encoder Input

Now we add positional vectors to embeddings. For each token, take its learned embedding vector (size d_model) and add the positional vector for that token's position. The result is a combined vector that contains both meaning (embedding) and position (positional encoding). The Final encoder input is the sum of both vectors.

Why trigonometric functions? Trigonometric functions like cos and sin naturally represent a continuous pattern that the model can recognise so relative positions are easier to perceive. By watching the plot of these functions, we can observe a regular pattern, so we can hypothesize that the model will see it too.

Multi-Headed Attention

Before going into multi-headed attention, let's look at self-attention to better appreciate what multi-head attention adds.

Self-Attention

Self-attention is the mechanism that lets every token in a sequence relate to every other token, so each token's representation becomes aware of the whole sentence. In this simple case we consider the sequence length seq = 6 and d_model = d_k = 512. Each token "looks at" the other tokens, measures how relevant they are, and then gathers information from them to produce a new representation.

Figure 7 - Formula for Self-Attention

We form three matrices Q (queries), K (keys), and V (values) from the input embeddings, each of shape (6, 512). We multiply Q by Kᵀ to get pairwise similarity scores, divide by √d_k to stabilize gradients, apply softmax so each row sums to 1, then multiply by V.

Figure 8 - Self-Attention: Q × Kᵀ / √512 → Softmax → (6,6) Attention Matrix

Figure 9 - Multiply Softmax Result by V to Get Attention Output (6, 512)

Each entry in the matrix after multiplication indicates a score of how intense a relationship one word has with another. Perform the matrix multiplication by hand and you'll see why each row captures not only the meaning (given in the embedding), the position in the sentence (represented by the positional encodings) but also each word's interaction with all other words.

Multi-Head Attention

Instead of computing a single attention operation, we split Q, K, V into h heads and run attention in each smaller subspace in parallel. Each head watches the full sentence but a different aspect of the embedding of each word - because a word may have different meanings depending on context.

Figure 10 - Multi-Head Attention: Project → Split → h Heads → Concat → W_O → MH-A

1
Start with the input matrix of shape (seq, d_model). Copy it three times: one proceeds to Add & Norm later, and three copies form Q, K, V inputs.
2
Project Q, K, V by multiplying with learned matrices W_Q, W_K, W_V each of shape (d_model, d_model), producing Q', K', V' of shape (seq, d_model).
3
Split into h heads along the embedding dimension. Each head has dimension d_k = d_model / h. This yields Q₁, Q₂, . . . , Q_h (and similarly K_i, V_i), each of shape (seq, d_k).
4
For each head i, compute scaled dot-product attention: softmax((Q_i K_i^T) / √d_k) V_i, producing Head_i of shape (seq, d_k). Each head encodes a different aspect of token relationships.
5
Concatenate all h heads to form Concat(head₁,…,head_h) of shape (seq, h * d_k). Since h * d_k = d_model, this restores shape (seq, d_model).
6
Apply final linear projection W_O of shape (d_model, d_model) to produce the MH-A output of shape (seq, d_model).

Add and Norm

Moving on to the last part of the multi-head attention block, we reach the Add & Norm layer. It performs two key functions:

Add (Residual Connection): The output of a sublayer (such as Multi-Head Attention or Feed Forward Network) is added to its original input. This helps prevent loss of information and mitigates the vanishing gradient problem, allowing gradients to flow easily through deep networks.

Norm (Layer Normalization): After adding the residual connection, the result is normalized across its features to ensure that the data has a stable distribution. This makes learning faster and more reliable by keeping activations within a consistent range.

Figure 11 - Calculating the Mean and Variance of Each Item Independently

1
Take the input x and the sublayer output sublayer(x). Add them to form the residual: y = x + sublayer(x).
2
Compute the mean μ and variance σ² of all features for each token embedding in y.
3
Normalize: ŷ = (y − μ) / √(σ² + ε), where ε is a small constant preventing division by zero.
4
Scale and shift using learnable parameters: LayerNorm(yᵢ) = γ · ŷᵢ + β. γ (gamma) and β (beta) introduce controlled fluctuations, the network learns to stress when one word is more intense than another.

Why add gamma and beta? These introduce some fluctuations in the data because having all data between 0–1 may be too restrictive for the network. The network learns to tune these two parameters to introduce fluctuations when necessary, essentially using them to stress when a word is more intense than another.

That's all for Part 1!

It was a long one, take some time to re read it to fully grasp the beautiful design of the encoder. We'll pick up right where we left off with the Decoder in the next part.

Credit to Umar Jamil for his fantastic video on the "Attention Is All You Need" paper and for the accompanying visuals. His explanation made these concepts much easier to understand. Be sure to check out his video yourself!

Attention Really IsAll You Need

Introduction

The Rise of Sequence Models

How RNNs work

Enter the Transformer

Input Embedder

Positional Encoder

Multi-Headed Attention

Self-Attention

Multi-Head Attention

Add and Norm

Attention Really Is
All You Need