RNN Q&A

Recurrent neural networks for NLP – short Q&A

20 questions and answers on RNN-based sequence modeling, including vanilla RNNs, vanishing gradients, LSTMs, GRUs and their historical role in NLP tasks.

1

What is a recurrent neural network (RNN)?

Answer: An RNN is a neural architecture that processes sequences step by step, maintaining a hidden state that is updated at each time step to capture information about previous elements in the sequence.

2

Why are RNNs useful for NLP tasks?

Answer: Many NLP tasks involve sequences of tokens where word order matters; RNNs can, in principle, model dependencies across positions, making them suitable for language modeling, tagging and sequence generation.

3

What is the vanishing gradient problem in RNNs?

Answer: During backpropagation through time, gradients can shrink exponentially as they pass through many time steps, making it difficult for vanilla RNNs to learn long-range dependencies in sequences.

4

How do LSTMs address vanishing gradients?

Answer: LSTMs introduce a cell state and gating mechanisms (input, forget and output gates) that regulate information flow and maintain gradients more effectively over long time spans, enabling learning of longer dependencies.

5

What is a GRU and how does it differ from an LSTM?

Answer: A GRU is a simplified gated RNN that combines the forget and input gates into an update gate and uses a reset gate, offering similar performance to LSTMs with fewer parameters and a simpler structure.

6

What is a bidirectional RNN and why is it useful?

Answer: A bidirectional RNN processes the sequence in both forward and backward directions and concatenates the hidden states, allowing models to use both past and future context for each position in tasks like tagging or classification.

7

How are RNNs used in language modeling?

Answer: In RNN language models, the network reads tokens sequentially and predicts the next token at each step based on the current hidden state, learning conditional distributions over words given preceding context.

8

What is teacher forcing in RNN training?

Answer: Teacher forcing feeds the ground-truth token as input at each time step during training rather than the model’s own prediction, which stabilizes learning but introduces exposure bias at test time.

9

How do RNNs compare to transformers for NLP?

Answer: RNNs process tokens sequentially and struggle with very long contexts, while transformers use self-attention to model all pairwise interactions in parallel, generally outperforming RNNs on large-scale NLP benchmarks today.

10

Do RNNs still have a role in modern NLP?

Answer: Although transformers dominate, RNNs are still used in lightweight or low-latency settings, embedded applications and as baselines or educational models for understanding sequence learning concepts.

11

What is gradient clipping and why is it used with RNNs?

Answer: Gradient clipping limits the norm of gradients during backpropagation to prevent exploding gradients, which can occur in deep or long unrolled RNNs and destabilize training.

12

How do stacked RNNs increase model capacity?

Answer: Stacked or multi-layer RNNs feed the hidden sequence of one RNN layer into another, allowing the network to learn more abstract sequence representations at higher layers, similar to deep feed-forward networks.

13

What is truncated backpropagation through time (BPTT)?

Answer: Truncated BPTT limits how many time steps the gradient is propagated backward, reducing memory and computation cost at the expense of only partially capturing long-range dependencies.

14

How are RNNs used for sequence labeling tasks like POS tagging?

Answer: In sequence labeling, bidirectional RNNs produce contextual hidden states for each token, which are then passed to a classifier or CRF layer to assign labels such as POS tags or NER categories.

15

What are some drawbacks of RNNs for long sequences?

Answer: RNNs process tokens sequentially, limiting parallelism and making training slow; they also struggle with very long-range dependencies despite gating and are harder to scale than attention-based models.

16

How did RNNs contribute to early neural MT systems?

Answer: Early neural machine translation used encoder–decoder architectures with RNNs (often LSTMs or GRUs) and attention, significantly outperforming phrase-based SMT before transformers became dominant.

17

What types of regularization are used with RNNs?

Answer: Techniques include dropout on inputs or between RNN layers, recurrent dropout on hidden-to-hidden connections, L2 weight decay and early stopping to prevent overfitting on sequence data.

18

What is the difference between many-to-one and many-to-many RNN setups?

Answer: Many-to-one RNNs map an input sequence to a single output (e.g. sentiment classification), while many-to-many RNNs produce an output at each time step (e.g. tagging, translation or speech recognition).

19

Why is weight initialization important in RNNs?

Answer: Poor initialization can exacerbate vanishing or exploding gradients; careful schemes such as orthogonal initialization for recurrent weights help stabilize training dynamics in RNNs.

20

What should you know about RNNs for interviews today?

Answer: You should understand basic RNN equations, vanishing gradients, LSTM/GRU mechanisms, typical NLP applications and why transformers largely replaced RNNs in state-of-the-art language models.

🔍 RNN concepts covered

This page covers RNNs for NLP: vanilla recurrent networks, vanishing gradients, LSTM and GRU variants, bidirectional and stacked RNNs, language modeling and how RNNs compare to transformers in modern NLP.

RNN basics & BPTT
LSTM & GRU gating
Sequence labeling & LM
Regularization & clipping
RNN vs transformers
Interview essentials