NLP with Deep Learning Text AI
Transformers & LLMs Embeddings

NLP with Deep Learning: Teaching Machines to Understand Language

Natural Language Processing (NLP) has been revolutionized by deep learning. From word vectors to pretrained Transformers that power ChatGPT — complete guide covering architectures, embeddings, sequence models, attention, and state-of-the-art language models.

2013
Word2Vec

Static embeddings

2014
Seq2Seq

+Attention

2017
Transformer

Self-attention

2018
BERT

Pretraining

2020+
GPT-3/4

LLMs

Now
Multi-modal

Text+Image

Why Deep Learning for NLP?

Traditional NLP relied on hand-crafted features (POS tags, parse trees, lexicon). Deep learning enables end-to-end systems that learn hierarchical representations directly from raw text, capturing syntax, semantics, and world knowledge.

Classical NLP

Feature engineering, sparse representations, separate components (tokenizer → tagger → parser → classifier). Brittle pipelines.

Deep Learning NLP

Learned dense embeddings, hierarchical feature extraction, joint training, transfer learning. One model for multiple tasks.

Raw Text Tokenization Embedding Encoder (RNN/Transformer) Task Head Output

All components differentiable, trained end-to-end.

From Raw Text to Tokens

Tokenization Strategies
  • Word-level: Split by space/punctuation. Large vocab (50k+). OOV problem.
  • Character-level: No OOV, long sequences.
  • Subword (BPE, WordPiece, Unigram): Byte-Pair Encoding. Balance between word & char. Used in BERT, GPT.
Subword Tokenization (Hugging Face)
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("I love NLP with deep learning!")
# ['i', 'love', 'nl', '##p', 'with', 'deep', 'learning', '!']
ids = tokenizer.convert_tokens_to_ids(tokens)
Byte-Pair Encoding (BPE): Iteratively merge most frequent character pairs. "low" + "est" → "lowest".

Word Embeddings: You Shall Know a Word by Its Context

Word2Vec (2013)

CBOW: Predict word from context.
Skip-gram: Predict context from word.

maximize log p(context | word)

GloVe (2014)

Global Vectors. Factorizes word co-occurrence matrix. Combines count-based & prediction-based.

FastText (2016)

Subword information. Each word as bag of character n-grams. Handles OOV, morphology.

Word2Vec Skip-gram (PyTorch-style)
class SkipGram(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.target_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.context_embeddings = nn.Embedding(vocab_size, embedding_dim)
    
    def forward(self, target, context):
        # target: (batch,), context: (batch,)
        v = self.target_embeddings(target)       # (batch, emb)
        u = self.context_embeddings(context)     # (batch, emb)
        score = torch.sum(v * u, dim=1)          # (batch,)
        return score  # use with BCEWithLogitsLoss
King - Man + Woman ≈ Queen – Embeddings capture analogies via vector arithmetic.

Recurrent Neural Networks (RNNs, LSTMs, GRUs)

Process text sequentially, maintaining hidden state. Natural fit for variable-length sequences.

RNN

hₜ = tanh(Wᵢₕ xₜ + bᵢₕ + Wₕₕ hₜ₋₁ + bₕₕ).
Vanishing gradients on long sequences.

LSTM (1997)

Forget gate, input gate, output gate, cell state. Mitigates vanishing gradients. Long-range dependencies.

GRU (2014)

Update gate, reset gate. Fewer parameters than LSTM. Comparable performance.

LSTM for Text Classification (PyTorch)
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, num_classes)
    
    def forward(self, x):
        x = self.embedding(x)                # (batch, seq_len, embed_dim)
        _, (h_n, _) = self.lstm(x)          # h_n: (num_layers, batch, hidden_dim)
        out = h_n[-1]                       # last layer hidden state
        return self.fc(out)

Seq2Seq & Attention: The Breakthrough

Encoder-decoder architecture for translation, summarization, chatbots. Attention solves the bottleneck.

Seq2Seq without Attention

Encoder final state → entire sentence representation. Decoder generates. Information loss for long sentences.

Seq2Seq + Attention

Decoder computes context vector as weighted sum of encoder states. Dynamic alignment.

Bahdanau Luong

2014-2015: Attention mechanism enables state-of-the-art MT. Google Neural Machine Translation adopts Seq2Seq+Attention.

Transformers: The Architecture That Changed Everything

No recurrence. Pure self-attention. Parallelizable, scalable, contextual embeddings.

Self-Attention

Each token attends to all tokens. Captures long-range context.

Multi-Head

Multiple attention views: syntax, semantics, coreference.

Positional Encoding

Injects sequence order information.

Transformer Block:
Input → [Multi-Head Self-Attention] → Add & Norm → [Feed-Forward] → Add & Norm → Output

Pretrained Language Models: The Foundation of Modern NLP

Train on massive unlabeled text, then fine-tune on downstream tasks. Transfer learning for NLP.

BERT (2018)

Bidirectional Encoder. Masked LM + Next Sentence Prediction. Deeply contextual. SOTA on 11 tasks at release.

Encoder-only Understanding

GPT (2018-...)

Autoregressive Decoder. Predict next token. GPT-3 (175B): few-shot, in-context learning. ChatGPT: instruction-tuned.

Decoder-only Generation

T5

Text-to-Text framework. Encoder-decoder.

RoBERTa

BERT with better hyperparameters, more data.

ALBERT

Parameter-efficient via factorization.

Fine-tuning BERT with Hugging Face
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize dataset
encodings = tokenizer(texts, truncation=True, padding=True)

# Trainer API
training_args = TrainingArguments(output_dir="./results", evaluation_strategy="epoch")
trainer = Trainer(model=model, args=training_args, 
                  train_dataset=train_dataset, eval_dataset=eval_dataset)
trainer.train()

NLP Tasks & Model Heads

Text Classification

Sentiment, spam, topic. [CLS] token + FFN.

NER

Token classification. Linear layer on each token.

QA

Span extraction (BERT: start/end logits).

Translation

Seq2Seq (T5, Marian).

Summarization

BART, Pegasus.

Text Generation

GPT, Llama.

Semantic Similarity

Sentence-BERT.

Zero-shot

Natural language inference.

Production NLP Pipeline

📦 Offline:
  • Data collection & cleaning
  • Tokenization & model selection
  • Fine-tuning & evaluation
  • Model compression (quantization, distillation)
🚀 Online:
  • Serving (ONNX, TensorRT, TorchServe)
  • Latency optimization
  • Monitoring drift
  • CI/CD for models
ONNX Export for BERT
import torch
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
dummy_input = torch.randint(0, 1000, (1, 128))  # batch=1, seq_len=128
torch.onnx.export(model, dummy_input, "bert.onnx", 
                  input_names=["input_ids"], output_names=["logits"])

NLP with Deep Learning – Cheatsheet

Word2Vec Static
LSTM Sequential
Attention Alignment
Transformer Parallel
BERT Encoder
GPT Decoder
T5 Enc-Dec
LoRA PEFT

Deep Learning NLP Models Comparison

Model Year Architecture Pretraining Task Best For
Word2Vec2013Shallow NNCBOW/Skip-gramStatic embeddings
LSTM1997/2013Recurrent-Sequential modeling
Transformer2017Self-attentionTranslationParallel sequence processing
BERT2018Encoder (Transformer)Masked LM + NSPUnderstanding tasks
GPT-32020Decoder (Transformer)Autoregressive LMFew-shot generation
T52019Encoder-DecoderSpan corruptionText-to-text

NLP Pitfalls & Debugging

⚠️ Overfitting small data: Use pretrained models, not from scratch. Freeze embeddings, gradual unfreezing.
⚠️ OOV tokens: Use subword tokenization (BPE, WordPiece). Avoid word-level with small vocab.
✅ Max sequence length: Transformer O(n²). Use efficient attention or truncation. Longformer, BigBird for long docs.
✅ Learning rate: Transformers: use warmup (5-10% steps), AdamW, linear decay. BERT: 2e-5 to 5e-5.

The Future of NLP

Efficiency

Pruning, quantization, distillation. Smaller models (DistilBERT, TinyBERT).

Multimodal

CLIP, Flamingo, GPT-4V. Text + images + audio.

Agents

Tool use, reasoning, planning (AutoGPT).

Alignment

RLHF, Constitutional AI. Safety.