Transformer Architecture | Nikhil Learn Hub

Attention Mechanism

The Attention Mechanism

The Attention mechanism revolutionized NLP by allowing models to look over the entire input sequence dynamically at every step of generating an output, rather than relying on a single static context vector.

How Attention Solved the Bottleneck

During translation (e.g., from English to French), when the decoder wants to output the French word for "apple," it doesn't just look at the static context vector. Instead, it looks back at all the hidden states of the encoder, assigns a weight (attention score) to each English word, and heavily focuses its "attention" specifically on the English word "apple".

Alignment

Attention provides an automatic, implicit alignment between source and target languages without any explicit linguistic rules.

Infinite Context

Because it can look directly at the encoder states, performance no longer drops drastically on long sentences.

Transformers Intro

What is a Transformer?

Introduced in 2017 by Google researchers in the paper "Attention Is All You Need", the Transformer architecture fundamentally changed NLP by replacing sequential processing (RNNs/LSTMs) with parallel processing via Self-Attention.

Level 1 â€” The Core Concept

The Transformer consists of an Encoder (to understand input) and a Decoder (to generate output). Unlike RNNs that look at words one by one, Transformers look at all words simultaneously.

Key Advantage: Parallelization

Because words are processed in parallel, Transformers can be trained on massive datasets using modern GPUs much faster than previous models.

Level 2 â€” Architecture Breakdown

A standard Transformer stack includes several identical layers. Each layer has two main sub-layers:

Multi-Head Self-Attention: Allows the model to focus on different parts of the sentence at once.
Feed-Forward Neural Network: Processes the information extracted by the attention layer.

Level 3 â€” Impact on NLP

The Transformer paved the way for "Foundation Models" like BERT and GPT. It solved the problem of "long-range dependencies" where RNNs would forget the beginning of a long sentence by the time they reached the end.

Hugging Face Transformers Usage

from transformers import pipeline

# The pipeline API is the easiest way to use Transformers
classifier = pipeline("sentiment-analysis")
result = classifier("Transformers are the backbone of modern AI.")
print(result)

Self-Attention

Self-Attention is the mechanism that allows a model to weigh the importance of different words in a sentence relative to a specific word. It allows the model to say, "To understand the word 'it' in this sentence, I need to look closely at the word 'robot'."

Level 1 â€” The QKV Intuition

Every word is mathematically represented by three vectors:

Query (Q): What I'm looking for

Key (K): What I can offer

Value (V): My actual information

Level 2 â€” The Scaled Dot-Product Math

The model calculates the "score" by taking the dot product of Q and K, dividing by the square root of the dimension (scaling), and applying a Softmax to get probabilities.

Attention(Q, K, V) = softmax(QKáµ€ / âˆšdâ‚–)V

Level 3 â€” Contextual Richness

By the end of this process, each word's vector is updated to include information from its neighbors. This "contextualized embedding" is what makes Transformers so much more powerful than static embeddings like Word2Vec.

PyTorch Attention Logic (Simulated)

import torch
import torch.nn.functional as F

# Simulated Q, K, V
q = torch.randn(1, 10, 64)
k = torch.randn(1, 10, 64)
v = torch.randn(1, 10, 64)

# Attention scores
scores = torch.matmul(q, k.transpose(-2, -1)) / (64**0.5)
weights = F.softmax(scores, dim=-1)
output = torch.matmul(weights, v)

Multi-Head Attention

In practice, a single self-attention mechanism is not enough. Multi-Head Attention allows the model to run multiple attention processes in parallel, each focusing on different types of relationships.

Level 1 â€” Why Multiple Heads?

One head might focus on grammar (subject-verb agreement), while another focuses on semantics (word meaning), and a third focuses on references (pronouns).

Level 2 â€” Concatenation and Projection

The results from all "heads" are concatenated into one long vector and then passed through a final linear layer to bring it back to the original dimension.

Multi-Head vs Single-Head

Single-head attention averages out all relationships. Multi-head attention allows the model to maintain multiple distinct "interpretations" of the sentence simultaneously.

Level 3 â€” Parameter Efficiency

Despite having multiple heads, we don't increase the total number of parameters significantly because we split the original dimension between the heads (e.g., 512 total dim / 8 heads = 64 dim per head).

Multi-Head Attention Structure

from torch import nn

# Example in PyTorch
mha = nn.MultiheadAttention(embed_dim=512, num_heads=8)
# query, key, value embeddings
q = torch.randn(10, 1, 512)
k = torch.randn(10, 1, 512)
v = torch.randn(10, 1, 512)

attn_output, _ = mha(q, k, v)

Positional Encoding

Since Transformers process all words at once, they have no inherent concept of order. To a pure Transformer, "Dog bites man" and "Man bites dog" are identical. Positional Encoding fixes this by adding a unique signature to each word's position.

Level 1 â€” Adding the Signature

Instead of relying on a sequence (1st word, 2nd word), we add a specific numeric vector to the word embedding that represents its position. This allows the model to know exactly where each word sits in the sentence.

Level 2 â€” Sinusoidal Functions

The original Transformer used math based on Sines and Cosines to generate these positions. This allows the model to generalize to sentence lengths it hasn't seen during training.

Why Sine/Cosine? It creates a smooth pattern where the model can easily calculate relative positions between words (e.g., word at position p+k can be expressed as a linear function of word at position p).

Level 3 â€” Learned vs Fixed Encodings

While the original Transformer used a fixed math formula, many modern models (like BERT) use Learned Positional Embeddings, where the model treats "Position 1" as just another word it needs to learn the meaning of during training.

Sinusoidal Positional Encoding (Numpy)

import numpy as np

def get_positional_encoding(pos, i, d_model):
    angle = pos / np.power(10000, (2 * (i//2)) / d_model)
    return np.sin(angle) if i % 2 == 0 else np.cos(angle)

# Generates a unique vector for every (position, dimension) pair