Transformer Architecture
Attention, transformers, self-attention, multi-head attention, and positional encoding.
Attention Mechanism
The Attention Mechanism
The Attention mechanism revolutionized NLP by allowing models to look over the entire input sequence dynamically at every step of generating an output, rather than relying on a single static context vector.
How Attention Solved the Bottleneck
During translation (e.g., from English to French), when the decoder wants to output the French word for "apple," it doesn't just look at the static context vector. Instead, it looks back at all the hidden states of the encoder, assigns a weight (attention score) to each English word, and heavily focuses its "attention" specifically on the English word "apple".
Alignment
Attention provides an automatic, implicit alignment between source and target languages without any explicit linguistic rules.
Infinite Context
Because it can look directly at the encoder states, performance no longer drops drastically on long sentences.
Transformers Intro
What is a Transformer?
Introduced in 2017 by Google researchers in the paper "Attention Is All You Need", the Transformer architecture fundamentally changed NLP by replacing sequential processing (RNNs/LSTMs) with parallel processing via Self-Attention.
Level 1 — The Core Concept
The Transformer consists of an Encoder (to understand input) and a Decoder (to generate output). Unlike RNNs that look at words one by one, Transformers look at all words simultaneously.
Key Advantage: Parallelization
Because words are processed in parallel, Transformers can be trained on massive datasets using modern GPUs much faster than previous models.
Level 2 — Architecture Breakdown
A standard Transformer stack includes several identical layers. Each layer has two main sub-layers:
- Multi-Head Self-Attention: Allows the model to focus on different parts of the sentence at once.
- Feed-Forward Neural Network: Processes the information extracted by the attention layer.
Level 3 — Impact on NLP
The Transformer paved the way for "Foundation Models" like BERT and GPT. It solved the problem of "long-range dependencies" where RNNs would forget the beginning of a long sentence by the time they reached the end.
from transformers import pipeline
# The pipeline API is the easiest way to use Transformers
classifier = pipeline("sentiment-analysis")
result = classifier("Transformers are the backbone of modern AI.")
print(result)
Self-Attention
Self-Attention
Self-Attention is the mechanism that allows a model to weigh the importance of different words in a sentence relative to a specific word. It allows the model to say, "To understand the word 'it' in this sentence, I need to look closely at the word 'robot'."
Level 1 — The QKV Intuition
Every word is mathematically represented by three vectors:
Level 2 — The Scaled Dot-Product Math
The model calculates the "score" by taking the dot product of Q and K, dividing by the square root of the dimension (scaling), and applying a Softmax to get probabilities.
Attention(Q, K, V) = softmax(QKᵀ / √dₖ)V
Level 3 — Contextual Richness
By the end of this process, each word's vector is updated to include information from its neighbors. This "contextualized embedding" is what makes Transformers so much more powerful than static embeddings like Word2Vec.
import torch
import torch.nn.functional as F
# Simulated Q, K, V
q = torch.randn(1, 10, 64)
k = torch.randn(1, 10, 64)
v = torch.randn(1, 10, 64)
# Attention scores
scores = torch.matmul(q, k.transpose(-2, -1)) / (64**0.5)
weights = F.softmax(scores, dim=-1)
output = torch.matmul(weights, v)
Multi-Head Attention
Multi-Head Attention
In practice, a single self-attention mechanism is not enough. Multi-Head Attention allows the model to run multiple attention processes in parallel, each focusing on different types of relationships.
Level 1 — Why Multiple Heads?
One head might focus on grammar (subject-verb agreement), while another focuses on semantics (word meaning), and a third focuses on references (pronouns).
Level 2 — Concatenation and Projection
The results from all "heads" are concatenated into one long vector and then passed through a final linear layer to bring it back to the original dimension.
Multi-Head vs Single-Head
Single-head attention averages out all relationships. Multi-head attention allows the model to maintain multiple distinct "interpretations" of the sentence simultaneously.
Level 3 — Parameter Efficiency
Despite having multiple heads, we don't increase the total number of parameters significantly because we split the original dimension between the heads (e.g., 512 total dim / 8 heads = 64 dim per head).
from torch import nn
# Example in PyTorch
mha = nn.MultiheadAttention(embed_dim=512, num_heads=8)
# query, key, value embeddings
q = torch.randn(10, 1, 512)
k = torch.randn(10, 1, 512)
v = torch.randn(10, 1, 512)
attn_output, _ = mha(q, k, v)
Positional Encoding
Positional Encoding
Since Transformers process all words at once, they have no inherent concept of order. To a pure Transformer, "Dog bites man" and "Man bites dog" are identical. Positional Encoding fixes this by adding a unique signature to each word's position.
Level 1 — Adding the Signature
Instead of relying on a sequence (1st word, 2nd word), we add a specific numeric vector to the word embedding that represents its position. This allows the model to know exactly where each word sits in the sentence.
Level 2 — Sinusoidal Functions
The original Transformer used math based on Sines and Cosines to generate these positions. This allows the model to generalize to sentence lengths it hasn't seen during training.
p+k can be expressed as a linear function of word at position p).
Level 3 — Learned vs Fixed Encodings
While the original Transformer used a fixed math formula, many modern models (like BERT) use Learned Positional Embeddings, where the model treats "Position 1" as just another word it needs to learn the meaning of during training.
import numpy as np
def get_positional_encoding(pos, i, d_model):
angle = pos / np.power(10000, (2 * (i//2)) / d_model)
return np.sin(angle) if i % 2 == 0 else np.cos(angle)
# Generates a unique vector for every (position, dimension) pair