NLP Tutorial

Recurrent Neural Networks

RNN, LSTM, GRU, bidirectional RNNs, and Seq2Seq architectures for NLP.

RNN for NLP

Recurrent Neural Networks (RNN)

Recurrent Neural Networks (RNNs) are a class of artificial neural networks designed specifically for processing sequential data, such as time series, audio, or natural language text.

How RNNs Differ from Feed-Forward Nets

Feed Forward (Standard DNN)

A standard neural network processes the entire input all at once and has a fixed input size. It has no concept of "memory" or order.

Recurrent (RNN)

RNNs process sequences step-by-step. They maintain a Hidden State (a memory vector) that gets continually updated as it reads through the sentence word by word.

Level 1 — Building an RNN in Keras

RNNs excel at sequence classification (like sentiment analysis). Here, we process words sequentially to determine if a review is positive or negative.

Vanilla text classification RNN
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense, Embedding

vocab_size = 10000
embedding_dim = 32
max_sequence_length = 100

model = Sequential([
    # Turn positive integers (word indices) into dense vectors of fixed size
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_sequence_length),
    
    # Vanilla RNN layer: maintains a 64-dimension hidden state across time steps
    SimpleRNN(64, return_sequences=False),
    
    # Binary classification output layer
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

Level 2 — Bidirectional RNNs

A standard RNN only knows what happened before the current word. A Bidirectional RNN processes the sentence going forwards AND backwards simultaneously, concatenating the hidden states. This provides full context of the whole sentence!

Bidirectional RNN setup
from tensorflow.keras.layers import Bidirectional

bi_model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim),
    # By wrapping the RNN in Bidirectional, Keras automatically handles 
    # the forward and backward passes and combines them.
    Bidirectional(SimpleRNN(64)),
    Dense(1, activation='sigmoid')
])

The Core Flaw: The Vanishing Gradient Problem

Why don't we use purely Vanilla RNNs? As an RNN processes a very long sequence (like a paragraph), during backpropagation, gradients are multiplied many times. Values smaller than 1 quickly disappear to 0 (Vanishing Gradient). Result: A Vanilla RNN suffers from short-term memory and forgets the beginning of a sentence by the time it reaches the end. This led to the creation of LSTMs!

LSTM Networks

Long Short-Term Memory (LSTM) Networks

LSTMs are a highly modified version of RNNs specifically engineered to solve the Vanishing Gradient problem and remember long-term dependencies.

The Architecture of an LSTM Cell

While an RNN has a simple single neural net layer (like tanh) inside its repeating module, an LSTM cell contains four interacting layers wrapped into "gates" that control the flow of information.

  • 1. Cell State (The Conveyor Belt): The core memory straight down the middle of the cell. Information flows along it smoothly with minor linear interactions.
  • 2. Forget Gate: Decides what information from the past memory we should throw away or forget (outputs between 0 and 1).
  • 3. Input Gate: Decides what new information from the current word we should add to our memory.
  • 4. Output Gate: Decides what part of the memory we should output as our hidden state for this time step.

Level 1 — Implementing LSTMs for NLP

Because Keras and PyTorch hide the complex gate math, implementing an LSTM is as simple as importing the layer.

LSTM Model in Keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, Dropout

model = Sequential([
    Embedding(input_dim=20000, output_dim=128),
    # The LSTM layer replaces the SimpleRNN layer. 
    # It has 128 internal units managing the Cell State and Gates.
    LSTM(128, return_sequences=True), 
    Dropout(0.2), # Dropout helps prevent overfitting
    
    # We can stack LSTMs by returning sequences from the first one
    LSTM(64),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Level 2 — Uses of LSTMs

For roughly 5 years (2013-2018), LSTMs were the undisputed kings of NLP before Transformers took over. They were used for:

Machine Translation

Google Translate used a massive stack of Bi-LSTMs in its 2016 Neural MT system.

Text Generation

Predicting the next character or word (like predictive keyboards).

Speech Recognition

Converting audio sequences to text transciptions.

GRUs (Gated Recurrent Units) are a popular variant of LSTMs. They combine the forget and input gates into a single "update gate" and merge the cell state and hidden state, making them computationally cheaper and faster to train while offering similar performance.

Gated Recurrent Units (GRU)

Gated Recurrent Units (GRU)

The GRU (Gated Recurrent Unit) is a newer, simplified version of the LSTM introduced by Kyunghyun Cho et al. in 2014. It solves the vanishing gradient problem like LSTMs but uses fewer parameters, making it faster to train.

Simplified Architecture

A GRU combines the LSTM's forget and input gates into a single Update Gate. It also merges the cell state and hidden state, leaving only two gates total: Reset and Update.

Faster Training

Fewer tensor operations mean GRUs typically train faster and use less memory than LSTMs, often achieving identical performance on datasets with less data.

Level 1 — GRU Implementation in Keras

Using a GRU layer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GRU, Dense, Embedding

model = Sequential([
    Embedding(input_dim=10000, output_dim=128),
    # Swap out 'LSTM' for 'GRU'. GRUs are great right out of the box!
    GRU(128, return_sequences=False),
    Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Bidirectional RNNs

Bidirectional RNNs

Standard RNNs read text from left-to-right. A Bidirectional RNN (BiRNN) reads text simultaneously forward and backward, allowing the network to understand the context of a word using both the words that came before it and after it.

Why Bi-Directional Context Matters

Consider the sentence:

"He said, 'Teddy bears are on sale!'" vs "He said, 'Teddy Roosevelt was a president.'"

A forward-only RNN seeing "Teddy" doesn't know if it's a toy or a person. A BiRNN looks ahead to see "bears" or "Roosevelt", completely disambiguating the context instantly.

Level 1 — Bi-LSTM in Keras

Wrapping an LSTM in Bidirectional
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Bidirectional, LSTM, Dense, Embedding

model = Sequential([
    Embedding(input_dim=5000, output_dim=64),
    # The Bidirectional wrapper duplicates the layer: one forward, one backward.
    # The outputs are concatenated, so 64 units x 2 = 128 dimension output.
    Bidirectional(LSTM(64)),
    Dense(1, activation='sigmoid')
])

Seq2Seq Models

Sequence-to-Sequence (Seq2Seq)

The Seq2Seq (Encoder-Decoder) architecture maps an input sequence (like an English sentence) to an output sequence of a completely different length (like a French sentence). It is the backbone of Machine Translation and Summarization.

The Two Components

  • 1. Encoder: An RNN (usually an LSTM) that reads the input sequence step by step and compresses its entirety into a single fixed-size vector called the Context Vector.
  • 2. Decoder: A second RNN that takes this Context Vector as its initial state, and generates the output sequence one token at a time until it produces an [END] token.
The Bottleneck Problem: In a vanilla Seq2Seq model, all the information of a massive 50-word sentence must be squeezed into one fixed tiny vector before decoding. This "information bottleneck" degrades quality for long sentences. (The solution is Attention!)