Self-Attention Tutorial

Self-Attention

Understanding Queries, Keys, Values, and Scaled Dot-Product Attention.

Self-Attention

Self-Attention is the mechanism that allows a model to weigh the importance of different words in a sentence relative to a specific word. It allows the model to say, "To understand the word 'it' in this sentence, I need to look closely at the word 'robot'."

Level 1 — The QKV Intuition

Every word is mathematically represented by three vectors:

Query (Q): What I'm looking for
Key (K): What I can offer
Value (V): My actual information

Level 2 — The Scaled Dot-Product Math

The model calculates the "score" by taking the dot product of Q and K, dividing by the square root of the dimension (scaling), and applying a Softmax to get probabilities.

Attention(Q, K, V) = softmax(QKᵀ / √dₖ)V

Level 3 — Contextual Richness

By the end of this process, each word's vector is updated to include information from its neighbors. This "contextualized embedding" is what makes Transformers so much more powerful than static embeddings like Word2Vec.

PyTorch Attention Logic (Simulated)
import torch
import torch.nn.functional as F

# Simulated Q, K, V
q = torch.randn(1, 10, 64)
k = torch.randn(1, 10, 64)
v = torch.randn(1, 10, 64)

# Attention scores
scores = torch.matmul(q, k.transpose(-2, -1)) / (64**0.5)
weights = F.softmax(scores, dim=-1)
output = torch.matmul(weights, v)