Sequence-to-sequence models – short Q&A
20 questions and answers on encoder–decoder seq2seq architectures, attention mechanisms, training strategies and their use in machine translation and summarization.
What is a sequence-to-sequence (seq2seq) model?
Answer: A seq2seq model maps an input sequence to an output sequence of possibly different length using an encoder that processes the input and a decoder that generates the output step by step.
Which NLP tasks commonly use seq2seq architectures?
Answer: Machine translation, abstractive summarization, dialogue generation, data-to-text generation and grammatical error correction are all classic applications of seq2seq models in NLP.
How does the encoder in a seq2seq model work?
Answer: The encoder reads the input sequence and produces a sequence of hidden states or a final context vector that summarizes the input, which is then used to initialize or condition the decoder.
What problem arises when using a fixed-size context vector?
Answer: A single fixed-size vector can become a bottleneck for long or information-rich inputs, making it hard for the decoder to access all necessary details, which motivated attention mechanisms in seq2seq models.
How does attention improve seq2seq models?
Answer: Attention lets the decoder compute a weighted combination of encoder states at each step, dynamically focusing on relevant parts of the input instead of relying on a single context vector.
What is teacher forcing in seq2seq training?
Answer: Teacher forcing feeds the ground-truth previous token to the decoder at each time step during training, rather than its own prediction, which speeds up and stabilizes learning but introduces exposure bias.
What decoding strategies are used with seq2seq models?
Answer: Greedy decoding picks the highest-probability token at each step, while beam search maintains multiple candidate sequences to better approximate the most likely overall output sequence.
How do transformer-based seq2seq models differ from RNN-based ones?
Answer: Transformer seq2seq models replace recurrence with self-attention in both encoder and decoder, enabling greater parallelism and better modeling of long-range dependencies than RNN-based seq2seq architectures.
What is scheduled sampling and why is it used?
Answer: Scheduled sampling gradually replaces ground-truth tokens with model predictions during training, bridging the gap between training and inference and reducing exposure bias in seq2seq models.
Why is length control important in seq2seq generation?
Answer: Some tasks, like summarization, require outputs of specific length or compression; controlling length via special tokens, penalties or training objectives prevents overly short or long outputs.
What is exposure bias in the context of seq2seq models?
Answer: Exposure bias occurs because models are trained with teacher forcing but generate based on their own previous predictions at test time, leading to error accumulation in long sequences when early mistakes propagate.
How are seq2seq models evaluated?
Answer: Evaluation uses task-specific metrics: BLEU or COMET for translation, ROUGE for summarization, and sometimes human judgments of fluency and adequacy, depending on the target application.
What role does subword tokenization play in seq2seq models?
Answer: Subword methods like BPE or SentencePiece allow seq2seq models to handle rare and unknown words by composing them from subword units, improving robustness and vocabulary efficiency in generation tasks.
How do copy or pointer mechanisms extend seq2seq models?
Answer: Copy mechanisms blend a standard vocabulary distribution with a pointer over source tokens, allowing seq2seq models to reproduce rare entities and numbers directly from the input while still generating free-form text.
What is the difference between encoder–decoder and language-model style generation?
Answer: Encoder–decoder models condition on a separate input sequence, while purely decoder-based language models treat tasks as text completion problems by encoding both input and output in a single prompt sequence.
Why did seq2seq with attention replace older phrase-based MT systems?
Answer: Seq2seq models with attention learn translation patterns end-to-end, handle long-range reordering and context better and avoid maintaining separate phrase tables and hand-engineered features used in SMT.
How do we incorporate coverage mechanisms in seq2seq models?
Answer: Coverage tracks how much attention each source token has received across decoding steps, helping reduce under-translation and over-translation errors by discouraging repeatedly or never attending to the same tokens.
What are some drawbacks of seq2seq models compared to modern large language models?
Answer: Traditional seq2seq models are usually trained for a single task and domain, lack large-scale pretraining and often require more task-specific engineering than versatile, pre-trained language models used today.
Why is understanding seq2seq still valuable?
Answer: Seq2seq provides the conceptual foundation for many modern architectures, including transformer encoder–decoders, and clarifies how conditioning on input sequences works in generative NLP models.
Where might you still deploy classic seq2seq models?
Answer: Seq2seq models remain useful in constrained environments, specialized applications with limited data or when lightweight, task-specific models are preferable to large pre-trained transformers.
🔍 Seq2seq concepts covered
This page covers seq2seq models: encoder–decoder architectures, attention and copy mechanisms, teacher forcing and scheduled sampling, decoding strategies and how seq2seq underpins modern transformer-based generation.