Attention Q&A

Attention mechanisms – short Q&A

20 questions and answers on attention mechanisms, covering additive and dot-product attention, self-attention, multi-head attention and their central role in modern transformer-based NLP.

1

What is attention in neural networks?

Answer: Attention is a mechanism that lets a model weight different parts of an input sequence according to their relevance to a query, producing a weighted sum (context vector) that focuses computation on important elements.

2

What are queries, keys and values in attention?

Answer: A query represents the item seeking information, keys index memory positions and values contain information; attention scores are computed between queries and keys and used to weight the values.

3

How does dot-product (scaled) attention work?

Answer: Dot-product attention computes scores as dot products between query and key vectors, scales them by the key dimension, applies softmax and uses the resulting weights to form a weighted sum of value vectors.

4

How does additive (Bahdanau) attention differ from dot-product attention?

Answer: Additive attention uses a small feed-forward network to combine query and key vectors and produce scores, while dot-product attention uses their inner product; both perform similarly in practice, with dot-product being faster on GPUs.

5

What is self-attention?

Answer: Self-attention is attention where queries, keys and values all come from the same sequence, allowing each position to attend to every other position and build contextual representations without recurrence.

6

What is multi-head attention and why is it used?

Answer: Multi-head attention runs several attention mechanisms in parallel with different learned projections, enabling the model to capture diverse types of relationships and combine them for richer representations.

7

How is attention used in encoder–decoder models for MT?

Answer: The decoder uses its current state as a query to attend over encoder states (keys/values), producing a context vector that highlights source words relevant to the next target word being generated.

8

What are attention masks and why are they needed?

Answer: Masks zero out attention scores for certain positions, such as padding tokens or future tokens in causal language modeling, ensuring the model does not attend to invalid or disallowed positions.

9

How does self-attention compare to RNNs for context modeling?

Answer: Self-attention directly connects every pair of positions with a learned weight, modeling long-range dependencies in constant depth, while RNNs must propagate information stepwise, which is harder for long sequences and less parallelizable.

10

What is the computational complexity of full self-attention?

Answer: Full self-attention has O(n²) time and memory complexity with respect to sequence length n, due to computing pairwise scores between all token positions, which can be expensive for very long sequences.

11

What are some variants of efficient or sparse attention?

Answer: Variants include local windowed attention, block-sparse patterns, low-rank approximations, Linformer, Longformer, BigBird and other architectures that reduce the quadratic cost for long sequences.

12

How does positional information enter self-attention models?

Answer: Because self-attention is permutation-invariant by itself, models add positional encodings or learned position embeddings to token representations so the network can distinguish and exploit order information.

13

What is cross-attention in transformers?

Answer: Cross-attention refers to attention where decoder queries attend over encoder keys and values, enabling the decoder to condition on the encoded source sequence in encoder–decoder transformer architectures like T5 or BART.

14

How is attention visualized and interpreted?

Answer: Attention weights can be plotted as heatmaps between tokens, giving insights into which parts of the input the model focuses on, though such visualizations must be interpreted cautiously and are not perfect explanations.

15

Does attention always correspond to human notions of importance?

Answer: Not necessarily; attention weights show where the model routes information but may not align perfectly with intuitive importance, so they are useful diagnostic tools but not definitive explanations of model reasoning.

16

What is self-attention’s role in BERT-like models?

Answer: In BERT, stacked self-attention layers build bidirectional contextual representations, letting each token incorporate information from the entire sequence in both directions for pretraining and downstream tasks.

17

How do attention heads specialize in practice?

Answer: Empirically, different heads may focus on syntactic dependencies, coreference links, positional patterns or other relations, although specialization is emergent and varies across layers and tasks.

18

What is causal (masked) self-attention?

Answer: Causal self-attention restricts each position to attend only to previous positions (and itself), enforcing an autoregressive factorization for language modeling and text generation tasks like GPT-style models.

19

Why is attention called “all you need” in the transformer paper?

Answer: The transformer paper showed that stacked attention and feed-forward layers, without recurrence or convolution, could achieve state-of-the-art performance on translation, suggesting attention alone suffices for many sequence tasks.

20

Why is understanding attention critical for modern NLP engineers?

Answer: Attention is the core building block of transformers and large language models; understanding its math and behavior is essential for debugging, adapting and innovating on today’s dominant NLP architectures.

🔍 Attention concepts covered

This page covers attention mechanisms: additive and dot-product attention, self-attention and multi-head attention, masks and positional encodings, plus how attention underpins transformers and modern large language models.

Queries, keys & values
Additive & dot-product
Self- & cross-attention
Multi-head attention
Efficiency & sparsity
Interpretation & practice