Transformer architectures – short Q&A
20 questions and answers on transformer models, including self-attention, encoder–decoder structure, pretraining objectives like MLM and autoregressive LM, and how transformers power modern NLP.
What is the key idea behind the transformer architecture?
Answer: Transformers replace recurrence and convolution with stacked self-attention and feed-forward layers, enabling efficient parallel computation and direct modeling of relationships between all tokens in a sequence.
What are the main components of a transformer encoder layer?
Answer: An encoder layer consists of multi-head self-attention followed by a position-wise feed-forward network, each wrapped with residual connections and layer normalization for stable deep training.
How does a transformer decoder differ from the encoder?
Answer: The decoder adds a masked self-attention block to enforce autoregressive generation and a cross-attention block that lets it attend over encoder outputs, in addition to feed-forward layers and residual connections.
Why do transformers need positional encodings?
Answer: Self-attention alone is permutation-invariant, so positional encodings or learned position embeddings inject order information, allowing the model to reason about token positions and relative distances.
What is multi-head self-attention?
Answer: Multi-head self-attention runs several attention operations in parallel on different learned projections of the inputs, capturing multiple types of relationships and combining them for richer token representations.
How does BERT use the transformer architecture?
Answer: BERT is a stack of transformer encoder layers trained with masked language modeling and next sentence prediction, producing contextual embeddings that can be fine-tuned for many downstream NLP tasks.
How do GPT-style models differ from BERT?
Answer: GPT models use only transformer decoder blocks with causal self-attention and are trained on autoregressive language modeling, focusing on left-to-right text generation rather than bidirectional encoding.
What are common pretraining objectives for transformers?
Answer: Objectives include masked language modeling (MLM), autoregressive LM, next sentence prediction, permutation language modeling, replaced token detection and span corruption, depending on the specific model design.
Why are transformers effective for transfer learning in NLP?
Answer: Large transformers pretrained on massive corpora learn general language representations that can be adapted with relatively little labeled data, providing strong performance across many downstream tasks via fine-tuning or prompting.
What are some drawbacks of transformers?
Answer: Transformers require substantial compute and memory, especially for long sequences due to O(n²) self-attention, and large pretrained models can be hard to deploy, interpret and train responsibly without careful engineering.
What are encoder-only, decoder-only and encoder–decoder transformers used for?
Answer: Encoder-only models (BERT) excel at understanding tasks, decoder-only models (GPT) at generation, and encoder–decoder models (T5, BART) at sequence-to-sequence tasks like translation and summarization.
What is layer normalization and why is it used in transformers?
Answer: Layer normalization rescales and recenters activations within each layer, stabilizing training in deep networks; transformers typically apply it around attention and feed-forward sublayers along with residual connections.
How are transformers adapted for long-document processing?
Answer: Long-document variants use sparse attention patterns, segment-level recurrence, hierarchical encoders or memory tokens to extend context length while controlling the quadratic cost of standard self-attention.
What is fine-tuning versus prompting for transformers?
Answer: Fine-tuning updates model weights on task data, while prompting keeps the model fixed and designs textual prompts or lightweight adapters to elicit desired behaviors from a general-purpose pretrained model.
What role do transformers play in large language models (LLMs)?
Answer: Most modern LLMs are large transformer-based architectures scaled up in depth, width and training data, using variants of decoder-only or encoder–decoder designs to perform a wide range of language tasks.
How does attention help transformers capture syntax and semantics?
Answer: Attention heads learn to focus on syntactic dependencies, coreference links and semantic relations between tokens, enabling layers of self-attention to encode rich structural and semantic information implicitly.
What are some common transformer variants for efficiency?
Answer: Variants like DistilBERT, ALBERT, Linformer, Performer and Longformer reduce parameter counts or attention complexity, making transformers more practical for deployment or long-context tasks.
What risks come with deploying large transformer models?
Answer: Risks include biased or toxic outputs, hallucinated facts, privacy leakage, high energy consumption and the need for strong safeguards, monitoring and alignment when using transformers in real-world systems.
How are transformers applied beyond text in multimodal settings?
Answer: Transformers process sequences of image patches, audio frames or mixed text–vision tokens, enabling vision transformers, multimodal LLMs and cross-modal retrieval systems with unified attention-based architectures.
Why should NLP practitioners deeply understand transformers today?
Answer: Transformers underpin most state-of-the-art NLP systems, so understanding their architecture, training and limitations is crucial for model selection, debugging, optimization and responsible deployment.
🔍 Transformer concepts covered
This page covers transformer models: encoder and decoder layers, multi-head self-attention, positional encodings, major pretraining objectives and how transformers power transfer learning and large language models in NLP.