Positional Encoding
How Transformers understand word order without recurrence.
Positional Encoding
Since Transformers process all words at once, they have no inherent concept of order. To a pure Transformer, "Dog bites man" and "Man bites dog" are identical. Positional Encoding fixes this by adding a unique signature to each word's position.
Level 1 — Adding the Signature
Instead of relying on a sequence (1st word, 2nd word), we add a specific numeric vector to the word embedding that represents its position. This allows the model to know exactly where each word sits in the sentence.
Level 2 — Sinusoidal Functions
The original Transformer used math based on Sines and Cosines to generate these positions. This allows the model to generalize to sentence lengths it hasn't seen during training.
p+k can be expressed as a linear function of word at position p).
Level 3 — Learned vs Fixed Encodings
While the original Transformer used a fixed math formula, many modern models (like BERT) use Learned Positional Embeddings, where the model treats "Position 1" as just another word it needs to learn the meaning of during training.
import numpy as np
def get_positional_encoding(pos, i, d_model):
angle = pos / np.power(10000, (2 * (i//2)) / d_model)
return np.sin(angle) if i % 2 == 0 else np.cos(angle)
# Generates a unique vector for every (position, dimension) pair