Understanding Transformers from Scratch
2025-01-20
A ground-up explanation of the Transformer architecture — the engine behind GPT, BERT, and modern AI — without hiding behind the math.
Every major AI breakthrough in the last few years — GPT-4, BERT, Stable Diffusion, AlphaFold — shares one thing in common: the Transformer architecture. Originally introduced in the 2017 paper Attention Is All You Need, it replaced recurrent networks for sequence tasks and then quietly took over almost everything else. This article walks through how it actually works.
The Problem Transformers Solve
Before Transformers, sequence modeling was dominated by RNNs (Recurrent Neural Networks) and LSTMs. These process tokens one at a time, left to right. This creates two problems:
- Sequential bottleneck — you cannot parallelise training because each step depends on the previous one.
- Long-range forgetting — information from the beginning of a long sequence gets diluted by the time you reach the end.
Transformers solve both by processing the entire sequence at once using a mechanism called self-attention.
Attention: The Core Idea
Attention asks: for each token in my sequence, how much should I care about every other token?
Each token is projected into three vectors:
- Q (Query) — what this token is looking for
- K (Key) — what this token offers to others
- V (Value) — the actual content this token contributes
The attention score between two tokens is the dot product of their Q and K vectors, scaled and softmaxed:
import torch
import torch.nn.functional as F
def attention(Q, K, V):
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
weights = F.softmax(scores, dim=-1)
return torch.matmul(weights, V)
If token 5 is the word "bank" and token 2 is "river", the model learns that these two should attend strongly to each other — so "bank" gets the financial interpretation vs. the geographical one based on context.
Multi-Head Attention
Running a single attention pass only captures one type of relationship. Multi-head attention runs several attention operations in parallel, each with different learned projections, then concatenates the results:
import torch.nn as nn
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, x):
B, T, D = x.shape
Q = self.W_q(x).view(B, T, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_k(x).view(B, T, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_v(x).view(B, T, self.num_heads, self.d_k).transpose(1, 2)
out = attention(Q, K, V)
out = out.transpose(1, 2).contiguous().view(B, T, D)
return self.W_o(out)
One head might learn syntactic relationships, another semantic ones, another positional dependencies — they each specialise.
Positional Encoding
Attention has no built-in sense of order — it treats the sequence as a set. To reintroduce position, a positional encoding vector is added to each token embedding before the first layer:
import math
def positional_encoding(seq_len, d_model):
pe = torch.zeros(seq_len, d_model)
position = torch.arange(0, seq_len).unsqueeze(1).float()
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
return pe
The sine/cosine pattern lets the model generalise to sequence lengths it has not seen during training.
The Full Transformer Block
One Transformer layer stacks:
- Multi-head self-attention
- Add & LayerNorm (residual connection)
- Feed-forward network (two linear layers with a ReLU)
- Add & LayerNorm again
class TransformerBlock(nn.Module):
def __init__(self, d_model, num_heads, ff_dim, dropout=0.1):
super().__init__()
self.attn = MultiHeadAttention(d_model, num_heads)
self.ff = nn.Sequential(
nn.Linear(d_model, ff_dim),
nn.ReLU(),
nn.Linear(ff_dim, d_model),
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
x = self.norm1(x + self.dropout(self.attn(x)))
x = self.norm2(x + self.dropout(self.ff(x)))
return x
Stack 12 of these (BERT-base) or 96 (GPT-4 level) and you have a working language model backbone.
Encoder vs. Decoder
The original Transformer had two halves:
- Encoder (BERT-style) — attends to the full sequence in both directions. Great for understanding tasks like classification and named entity recognition.
- Decoder (GPT-style) — uses masked attention so each token can only see past tokens. Great for generation.
Models like T5 use both halves together for translation and summarisation.
Why It Matters
The Transformer's genius is that self-attention is a general-purpose compute primitive. It works on text, images (Vision Transformer), audio, protein sequences, even game states. The architecture has barely changed since 2017 — the progress comes from scale, data, and training tricks on top of it.
Understanding this foundation makes everything else — fine-tuning, prompt engineering, retrieval-augmented generation — much more intuitive. You are not using a black box; you are using a known, principled structure.