Transformers Q&A

Transformer architectures – short Q&A

20 questions and answers on transformer models, including self-attention, encoder–decoder structure, pretraining objectives like MLM and autoregressive LM, and how transformers power modern NLP.

1

What is the key idea behind the transformer architecture?

Answer: Transformers replace recurrence and convolution with stacked self-attention and feed-forward layers, enabling efficient parallel computation and direct modeling of relationships between all tokens in a sequence.

2

What are the main components of a transformer encoder layer?

Answer: An encoder layer consists of multi-head self-attention followed by a position-wise feed-forward network, each wrapped with residual connections and layer normalization for stable deep training.

3

How does a transformer decoder differ from the encoder?

Answer: The decoder adds a masked self-attention block to enforce autoregressive generation and a cross-attention block that lets it attend over encoder outputs, in addition to feed-forward layers and residual connections.

4

Why do transformers need positional encodings?

Answer: Self-attention alone is permutation-invariant, so positional encodings or learned position embeddings inject order information, allowing the model to reason about token positions and relative distances.

5

What is multi-head self-attention?

Answer: Multi-head self-attention runs several attention operations in parallel on different learned projections of the inputs, capturing multiple types of relationships and combining them for richer token representations.

6

How does BERT use the transformer architecture?

Answer: BERT is a stack of transformer encoder layers trained with masked language modeling and next sentence prediction, producing contextual embeddings that can be fine-tuned for many downstream NLP tasks.

7

How do GPT-style models differ from BERT?

Answer: GPT models use only transformer decoder blocks with causal self-attention and are trained on autoregressive language modeling, focusing on left-to-right text generation rather than bidirectional encoding.

8

What are common pretraining objectives for transformers?

Answer: Objectives include masked language modeling (MLM), autoregressive LM, next sentence prediction, permutation language modeling, replaced token detection and span corruption, depending on the specific model design.

9

Why are transformers effective for transfer learning in NLP?

Answer: Large transformers pretrained on massive corpora learn general language representations that can be adapted with relatively little labeled data, providing strong performance across many downstream tasks via fine-tuning or prompting.

10

What are some drawbacks of transformers?

Answer: Transformers require substantial compute and memory, especially for long sequences due to O(n²) self-attention, and large pretrained models can be hard to deploy, interpret and train responsibly without careful engineering.

11

What are encoder-only, decoder-only and encoder–decoder transformers used for?

Answer: Encoder-only models (BERT) excel at understanding tasks, decoder-only models (GPT) at generation, and encoder–decoder models (T5, BART) at sequence-to-sequence tasks like translation and summarization.

12

What is layer normalization and why is it used in transformers?

Answer: Layer normalization rescales and recenters activations within each layer, stabilizing training in deep networks; transformers typically apply it around attention and feed-forward sublayers along with residual connections.

13

How are transformers adapted for long-document processing?

Answer: Long-document variants use sparse attention patterns, segment-level recurrence, hierarchical encoders or memory tokens to extend context length while controlling the quadratic cost of standard self-attention.

14

What is fine-tuning versus prompting for transformers?

Answer: Fine-tuning updates model weights on task data, while prompting keeps the model fixed and designs textual prompts or lightweight adapters to elicit desired behaviors from a general-purpose pretrained model.

15

What role do transformers play in large language models (LLMs)?

Answer: Most modern LLMs are large transformer-based architectures scaled up in depth, width and training data, using variants of decoder-only or encoder–decoder designs to perform a wide range of language tasks.

16

How does attention help transformers capture syntax and semantics?

Answer: Attention heads learn to focus on syntactic dependencies, coreference links and semantic relations between tokens, enabling layers of self-attention to encode rich structural and semantic information implicitly.

17

What are some common transformer variants for efficiency?

Answer: Variants like DistilBERT, ALBERT, Linformer, Performer and Longformer reduce parameter counts or attention complexity, making transformers more practical for deployment or long-context tasks.

18

What risks come with deploying large transformer models?

Answer: Risks include biased or toxic outputs, hallucinated facts, privacy leakage, high energy consumption and the need for strong safeguards, monitoring and alignment when using transformers in real-world systems.

19

How are transformers applied beyond text in multimodal settings?

Answer: Transformers process sequences of image patches, audio frames or mixed text–vision tokens, enabling vision transformers, multimodal LLMs and cross-modal retrieval systems with unified attention-based architectures.

20

Why should NLP practitioners deeply understand transformers today?

Answer: Transformers underpin most state-of-the-art NLP systems, so understanding their architecture, training and limitations is crucial for model selection, debugging, optimization and responsible deployment.

🔍 Transformer concepts covered

This page covers transformer models: encoder and decoder layers, multi-head self-attention, positional encodings, major pretraining objectives and how transformers power transfer learning and large language models in NLP.

Encoder / decoder design
Self- & cross-attention
Pretraining tasks
BERT, GPT & T5 families
Efficiency variants
Risks & applications