NLP with Deep Learning: Teaching Machines to Understand Language

Natural Language Processing (NLP) has been revolutionized by deep learning. From word vectors to pretrained Transformers that power ChatGPT â€” complete guide covering architectures, embeddings, sequence models, attention, and state-of-the-art language models.

2013

Word2Vec

Static embeddings

2014

Seq2Seq

+Attention

2017

Transformer

Self-attention

2018

BERT

Pretraining

2020+

GPT-3/4

LLMs

Now

Multi-modal

Text+Image

Why Deep Learning for NLP?

Traditional NLP relied on hand-crafted features (POS tags, parse trees, lexicon). Deep learning enables end-to-end systems that learn hierarchical representations directly from raw text, capturing syntax, semantics, and world knowledge.

Classical NLP

Feature engineering, sparse representations, separate components (tokenizer â†’ tagger â†’ parser â†’ classifier). Brittle pipelines.

Deep Learning NLP

Learned dense embeddings, hierarchical feature extraction, joint training, transfer learning. One model for multiple tasks.

Raw Text â†’ Tokenization â†’ Embedding â†’ Encoder (RNN/Transformer) â†’ Task Head â†’ Output

All components differentiable, trained end-to-end.

From Raw Text to Tokens

Tokenization Strategies

Word-level: Split by space/punctuation. Large vocab (50k+). OOV problem.
Character-level: No OOV, long sequences.
Subword (BPE, WordPiece, Unigram): Byte-Pair Encoding. Balance between word & char. Used in BERT, GPT.

Subword Tokenization (Hugging Face)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("I love NLP with deep learning!")
# ['i', 'love', 'nl', '##p', 'with', 'deep', 'learning', '!']
ids = tokenizer.convert_tokens_to_ids(tokens)

Byte-Pair Encoding (BPE): Iteratively merge most frequent character pairs. "low" + "est" â†’ "lowest".

Word Embeddings: You Shall Know a Word by Its Context

Word2Vec (2013)

CBOW: Predict word from context.
Skip-gram: Predict context from word.

maximize log p(context | word)

GloVe (2014)

Global Vectors. Factorizes word co-occurrence matrix. Combines count-based & prediction-based.

FastText (2016)

Subword information. Each word as bag of character n-grams. Handles OOV, morphology.

Word2Vec Skip-gram (PyTorch-style)

class SkipGram(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.target_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.context_embeddings = nn.Embedding(vocab_size, embedding_dim)
    
    def forward(self, target, context):
        # target: (batch,), context: (batch,)
        v = self.target_embeddings(target)       # (batch, emb)
        u = self.context_embeddings(context)     # (batch, emb)
        score = torch.sum(v * u, dim=1)          # (batch,)
        return score  # use with BCEWithLogitsLoss

King - Man + Woman â‰ˆ Queen â€“ Embeddings capture analogies via vector arithmetic.

Recurrent Neural Networks (RNNs, LSTMs, GRUs)

Process text sequentially, maintaining hidden state. Natural fit for variable-length sequences.

RNN

hâ‚œ = tanh(Wáµ¢â‚• xâ‚œ + báµ¢â‚• + Wâ‚•â‚• hâ‚œâ‚‹â‚ + bâ‚•â‚•).
Vanishing gradients on long sequences.

LSTM (1997)

Forget gate, input gate, output gate, cell state. Mitigates vanishing gradients. Long-range dependencies.

GRU (2014)

Update gate, reset gate. Fewer parameters than LSTM. Comparable performance.

LSTM for Text Classification (PyTorch)

class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, num_classes)
    
    def forward(self, x):
        x = self.embedding(x)                # (batch, seq_len, embed_dim)
        _, (h_n, _) = self.lstm(x)          # h_n: (num_layers, batch, hidden_dim)
        out = h_n[-1]                       # last layer hidden state
        return self.fc(out)

Seq2Seq & Attention: The Breakthrough

Encoder-decoder architecture for translation, summarization, chatbots. Attention solves the bottleneck.

Seq2Seq without Attention

Encoder final state â†’ entire sentence representation. Decoder generates. Information loss for long sentences.

Seq2Seq + Attention

Decoder computes context vector as weighted sum of encoder states. Dynamic alignment.

Bahdanau Luong

2014-2015: Attention mechanism enables state-of-the-art MT. Google Neural Machine Translation adopts Seq2Seq+Attention.

Transformers: The Architecture That Changed Everything

No recurrence. Pure self-attention. Parallelizable, scalable, contextual embeddings.

Self-Attention

Each token attends to all tokens. Captures long-range context.

Multi-Head

Multiple attention views: syntax, semantics, coreference.

Positional Encoding

Injects sequence order information.

Transformer Block:
Input â†’ [Multi-Head Self-Attention] â†’ Add & Norm â†’ [Feed-Forward] â†’ Add & Norm â†’ Output

Pretrained Language Models: The Foundation of Modern NLP

Train on massive unlabeled text, then fine-tune on downstream tasks. Transfer learning for NLP.

BERT (2018)

Bidirectional Encoder. Masked LM + Next Sentence Prediction. Deeply contextual. SOTA on 11 tasks at release.

Encoder-only Understanding

GPT (2018-...)

Autoregressive Decoder. Predict next token. GPT-3 (175B): few-shot, in-context learning. ChatGPT: instruction-tuned.

Decoder-only Generation

T5

Text-to-Text framework. Encoder-decoder.

RoBERTa

BERT with better hyperparameters, more data.

ALBERT

Parameter-efficient via factorization.

Fine-tuning BERT with Hugging Face

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize dataset
encodings = tokenizer(texts, truncation=True, padding=True)

# Trainer API
training_args = TrainingArguments(output_dir="./results", evaluation_strategy="epoch")
trainer = Trainer(model=model, args=training_args, 
                  train_dataset=train_dataset, eval_dataset=eval_dataset)
trainer.train()

NLP Tasks & Model Heads

Text Classification

Sentiment, spam, topic. [CLS] token + FFN.

NER

Token classification. Linear layer on each token.

QA

Span extraction (BERT: start/end logits).

Translation

Seq2Seq (T5, Marian).

Summarization

BART, Pegasus.

Text Generation

GPT, Llama.

Semantic Similarity

Sentence-BERT.

Zero-shot

Natural language inference.

Production NLP Pipeline

ðŸ“¦ Offline:

Data collection & cleaning
Tokenization & model selection
Fine-tuning & evaluation
Model compression (quantization, distillation)

ðŸš€ Online:

Serving (ONNX, TensorRT, TorchServe)
Latency optimization
Monitoring drift
CI/CD for models

ONNX Export for BERT

import torch
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
dummy_input = torch.randint(0, 1000, (1, 128))  # batch=1, seq_len=128
torch.onnx.export(model, dummy_input, "bert.onnx", 
                  input_names=["input_ids"], output_names=["logits"])

NLP with Deep Learning â€“ CheatsheetWord2Vec Static
LSTM Sequential
Attention Alignment
Transformer Parallel
BERT Encoder
GPT Decoder
T5 Enc-Dec
LoRA PEFT

Deep Learning NLP Models Comparison

Model	Year	Architecture	Pretraining Task	Best For
Word2Vec	2013	Shallow NN	CBOW/Skip-gram	Static embeddings
LSTM	1997/2013	Recurrent	-	Sequential modeling
Transformer	2017	Self-attention	Translation	Parallel sequence processing
BERT	2018	Encoder (Transformer)	Masked LM + NSP	Understanding tasks
GPT-3	2020	Decoder (Transformer)	Autoregressive LM	Few-shot generation
T5	2019	Encoder-Decoder	Span corruption	Text-to-text

NLP Pitfalls & Debugging

âš ï¸ Overfitting small data: Use pretrained models, not from scratch. Freeze embeddings, gradual unfreezing.

âš ï¸ OOV tokens: Use subword tokenization (BPE, WordPiece). Avoid word-level with small vocab.

âœ… Max sequence length: Transformer O(nÂ²). Use efficient attention or truncation. Longformer, BigBird for long docs.

âœ… Learning rate: Transformers: use warmup (5-10% steps), AdamW, linear decay. BERT: 2e-5 to 5e-5.

The Future of NLP

Efficiency

Pruning, quantization, distillation. Smaller models (DistilBERT, TinyBERT).

Multimodal

CLIP, Flamingo, GPT-4V. Text + images + audio.

Agents

Tool use, reasoning, planning (AutoGPT).

Alignment

RLHF, Constitutional AI. Safety.

Next: Reinforcement

Related Deep Learning Links