Positional Encoding Tutorial

Positional Encoding

How Transformers understand word order without recurrence.

Positional Encoding

Since Transformers process all words at once, they have no inherent concept of order. To a pure Transformer, "Dog bites man" and "Man bites dog" are identical. Positional Encoding fixes this by adding a unique signature to each word's position.

Level 1 — Adding the Signature

Instead of relying on a sequence (1st word, 2nd word), we add a specific numeric vector to the word embedding that represents its position. This allows the model to know exactly where each word sits in the sentence.

Level 2 — Sinusoidal Functions

The original Transformer used math based on Sines and Cosines to generate these positions. This allows the model to generalize to sentence lengths it hasn't seen during training.

Why Sine/Cosine? It creates a smooth pattern where the model can easily calculate relative positions between words (e.g., word at position p+k can be expressed as a linear function of word at position p).

Level 3 — Learned vs Fixed Encodings

While the original Transformer used a fixed math formula, many modern models (like BERT) use Learned Positional Embeddings, where the model treats "Position 1" as just another word it needs to learn the meaning of during training.

Sinusoidal Positional Encoding (Numpy)
import numpy as np

def get_positional_encoding(pos, i, d_model):
    angle = pos / np.power(10000, (2 * (i//2)) / d_model)
    return np.sin(angle) if i % 2 == 0 else np.cos(angle)

# Generates a unique vector for every (position, dimension) pair