Self-Attention
Understanding Queries, Keys, Values, and Scaled Dot-Product Attention.
Self-Attention
Self-Attention is the mechanism that allows a model to weigh the importance of different words in a sentence relative to a specific word. It allows the model to say, "To understand the word 'it' in this sentence, I need to look closely at the word 'robot'."
Level 1 — The QKV Intuition
Every word is mathematically represented by three vectors:
Level 2 — The Scaled Dot-Product Math
The model calculates the "score" by taking the dot product of Q and K, dividing by the square root of the dimension (scaling), and applying a Softmax to get probabilities.
Attention(Q, K, V) = softmax(QKᵀ / √dₖ)V
Level 3 — Contextual Richness
By the end of this process, each word's vector is updated to include information from its neighbors. This "contextualized embedding" is what makes Transformers so much more powerful than static embeddings like Word2Vec.
import torch
import torch.nn.functional as F
# Simulated Q, K, V
q = torch.randn(1, 10, 64)
k = torch.randn(1, 10, 64)
v = torch.randn(1, 10, 64)
# Attention scores
scores = torch.matmul(q, k.transpose(-2, -1)) / (64**0.5)
weights = F.softmax(scores, dim=-1)
output = torch.matmul(weights, v)