Technology Hub
AI Hub
Deep Learning
Transformers

Transformers MCQ Â· test your attention knowledge

From selfâ€‘attention to BERT & GPT â€“ 15 questions covering multiâ€‘head attention, positional encoding, masking, and core transformer concepts.

Easy: 5 Medium: 6 Hard: 4

Selfâ€‘Attention

Multiâ€‘Head

Positional Enc.

Masking

Your Transformer score

0/15

0 Correct0 Incorrect

Transformers: the foundation of modern NLP

The Transformer architecture, introduced in "Attention Is All You Need", replaces recurrence with selfâ€‘attention. This MCQ tests your understanding of attention mechanisms, positional encoding, and the building blocks of models like BERT and GPT.

Why selfâ€‘attention?

Selfâ€‘attention allows each token to attend to all other tokens in the sequence, capturing longâ€‘range dependencies and enabling parallelization â€“ a major improvement over RNNs/LSTMs.

Transformer glossary â€“ key concepts

Selfâ€‘Attention (Scaled Dotâ€‘Product)

Attention(Q,K,V) = softmax(QKáµ€/âˆšdâ‚–)V. Computes relevance between all pairs of positions.

Multiâ€‘Head Attention

Runs multiple attention heads in parallel, allowing the model to focus on different subspaces.

Positional Encoding

Adds information about token order (since selfâ€‘attention is permutationâ€‘invariant). Usually sine/cosine or learned.

Encoderâ€‘Decoder

Encoder maps input to representations; decoder generates output autoregressively (with crossâ€‘attention).

Masked Attention

In decoders, prevents attending to future tokens (causal masking). Also used in BERT's masked language modeling.

BERT (Encoderâ€‘only)

Bidirectional representations, pretrained with masked LM and next sentence prediction.

GPT (Decoderâ€‘only)

Autoregressive language model, trained to predict next token. Uses causal masking.

# Scaled dotâ€‘product attention (simplified)
def attention(Q, K, V, mask=None):
    scores = Q @ K.T / sqrt(d_k)
    if mask: scores = scores.masked_fill(mask == 0, -1e9)
    weights = softmax(scores, dim=-1)
    return weights @ V

Interview tip: Be ready to explain why transformers need positional encoding, the difference between BERT and GPT, and how multiâ€‘head attention works. This MCQ covers these distinctions.

Common Transformer interview questions

Why is selfâ€‘attention considered more parallelizable than RNNs?
Explain the role of the scaling factor âˆšdâ‚– in attention.
What is the purpose of multiâ€‘head attention?
How does positional encoding work? Why sine/cosine functions?
Compare encoderâ€‘only (BERT) vs decoderâ€‘only (GPT) architectures.
What is masked attention and where is it used?

AI Hub Next: GANs MCQ