Natural Language Processing Complete Tutorial
Beginner to Advanced Transformers & LLMs

Natural Language Processing Tutorial

Master NLP from fundamentals of text processing to advanced transformers like BERT and GPT with practical implementations in NLTK, spaCy, and Hugging Face.

Text Preprocessing

Tokenization, stemming

Word Embeddings

Word2Vec, GloVe

Transformers

BERT, GPT, T5

Libraries

NLTK, spaCy, HF

Introduction to Natural Language Processing

Natural Language Processing (NLP) is a subfield of artificial intelligence that focuses on the interaction between computers and human language. It enables machines to understand, interpret, and generate human language in a valuable way.

Evolution of NLP
  • 1950s: Georgetown-IBM experiment (machine translation)
  • 1960s: ELIZA (first chatbot)
  • 1990s: Statistical NLP approaches
  • 2013: Word2Vec (Mikolov et al.)
  • 2017: Transformer architecture (Vaswani et al.)
  • 2018: BERT (Google), GPT (OpenAI)
  • 2020+: Large Language Models (GPT-3, ChatGPT)
Why NLP Matters?
  • 90% of world's data is unstructured text
  • Automate customer service (chatbots)
  • Extract insights from documents
  • Enable search and information retrieval
  • Break language barriers (translation)
  • Power modern AI assistants

NLP Pipeline

Raw Text → Preprocessing → Feature Extraction → Modeling → Evaluation → Deployment

Text Preprocessing

Text preprocessing is the crucial first step in any NLP pipeline. It involves cleaning and preparing raw text for analysis.

Complete Text Preprocessing Pipeline
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Tokenization
    tokens = text.split()
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Stemming
    stemmer = PorterStemmer()
    stemmed = [stemmer.stem(word) for word in tokens]
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(word) for word in tokens]
    
    return {
        'original': text,
        'tokens': tokens,
        'stemmed': stemmed,
        'lemmatized': lemmatized
    }

# Example usage
text = "The cats are running quickly in the gardens!"
result = preprocess_text(text)
print(result)

Text Representation

Converting text into numerical representations that machine learning models can understand.

Bag of Words
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'I love NLP',
    'NLP is amazing',
    'I love learning NLP'
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print(X.toarray())
TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'I love NLP',
    'NLP is amazing',
    'I love learning NLP'
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print(X.toarray().round(2))

Word Embeddings

Word embeddings are dense vector representations that capture semantic meaning and relationships between words.

Word2Vec with Gensim
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Sample sentences
sentences = [
    "I love natural language processing",
    "Word embeddings capture semantic meaning",
    "NLP is fascinating and useful",
    "Deep learning powers modern NLP"
]

# Tokenize sentences
tokenized_sentences = [word_tokenize(sent.lower()) for sent in sentences]

# Train Word2Vec model
model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=100,
    window=5,
    min_count=1,
    workers=4
)

# Find similar words
similar_words = model.wv.most_similar('nlp')
print("Words similar to 'nlp':")
for word, score in similar_words:
    print(f"{word}: {score:.3f}")

# Get word vector
vector = model.wv['language']
print(f"\nVector for 'language': {vector[:5]}...")

Classical NLP Models

Part-of-Speech Tagging with NLTK
import nltk
nltk.download('averaged_perceptron_tagger')

text = "The cat sat on the mat"
tokens = nltk.word_tokenize(text)
pos_tags = nltk.pos_tag(tokens)

print(pos_tags)
# Output: [('The', 'DT'), ('cat', 'NN'), 
#          ('sat', 'VBD'), ('on', 'IN'), 
#          ('the', 'DT'), ('mat', 'NN')]
Named Entity Recognition
import spacy

nlp = spacy.load("en_core_web_sm")
text = "Apple Inc. is planning to open a new store in New York next year."
doc = nlp(text)

for ent in doc.ents:
    print(f"{ent.text}: {ent.label_}")
# Output:
# Apple Inc.: ORG
# New York: GPE

Transformers & Attention

The Transformer architecture revolutionized NLP with its self-attention mechanism.

    ┌─────────────────┐
    │   Input Text    │
    └────────┬────────┘
             ▼
    ┌─────────────────┐
    │  Tokenization   │
    └────────┬────────┘
             ▼
    ┌─────────────────┐
    │   Embeddings    │
    └────────┬────────┘
             ▼
    ┌─────────────────┐
    │    Positional   │
    │    Encoding     │
    └────────┬────────┘
             ▼
    ┌─────────────────┐
    │  Multi-Head     │
    │  Self-Attention │
    └────────┬────────┘
             ▼
    ┌─────────────────┐
    │  Feed-Forward   │
    └────────┬────────┘
             ▼
    ┌─────────────────┐
    │    Output       │
    └─────────────────┘
                        

Transformer Encoder Block Architecture

Self-Attention Mechanism
import numpy as np

def scaled_dot_product_attention(Q, K, V):
    """
    Attention(Q,K,V) = softmax(QK^T/√d_k)V
    """
    d_k = K.shape[-1]
    scores = np.matmul(Q, K.transpose()) / np.sqrt(d_k)
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
    output = np.matmul(attention_weights, V)
    return output, attention_weights

# Example
seq_len, d_k = 3, 4
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)

output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention weights:\n", weights.round(2))

BERT: Bidirectional Encoder Representations from Transformers

BERT revolutionized NLP by pretraining on massive text corpora using masked language modeling.

BERT with Hugging Face
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize input
text = "Natural Language Processing is fascinating!"
inputs = tokenizer(text, return_tensors='pt', 
                  padding=True, truncation=True)

# Get embeddings
with torch.no_grad():
    outputs = model(**inputs)
    
# Last hidden states
last_hidden_states = outputs.last_hidden_state
# Pooled output (CLS token)
pooled_output = outputs.pooler_output

print(f"Sequence length: {last_hidden_states.shape[1]}")
print(f"Hidden size: {last_hidden_states.shape[2]}")
print(f"Pooled output shape: {pooled_output.shape}")

GPT: Generative Pre-trained Transformers

GPT models are autoregressive language models that excel at text generation tasks.

Text Generation with GPT-2
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained GPT-2
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Encode input
input_text = "Natural Language Processing is"
inputs = tokenizer.encode(input_text, return_tensors='pt')

# Generate text
outputs = model.generate(
    inputs,
    max_length=50,
    num_return_sequences=1,
    temperature=0.7,
    do_sample=True
)

# Decode and print
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

NLP Applications

Sentiment Analysis
from transformers import pipeline

sentiment = pipeline('sentiment-analysis')
result = sentiment("I love this tutorial!")
print(result)
Translation
translator = pipeline(
    'translation_en_to_fr'
)
result = translator(
    "Hello, how are you?"
)
print(result)
Question Answering
qa = pipeline('question-answering')
result = qa({
    'question': 'What is NLP?',
    'context': 'NLP stands for Natural Language Processing.'
})
print(result)

NLP Tools & Libraries

NLTK

Classic NLP library

spaCy

Industrial-strength

Hugging Face

Transformers library

Gensim

Topic modeling

Why Master NLP?

  • High demand in industry (AI engineers, data scientists)
  • Power modern applications (chatbots, search, assistants)
  • Foundation for Large Language Models (LLMs)
  • Cross-domain applications (healthcare, finance, legal)
  • Rapidly evolving field with cutting-edge research
  • Combine linguistics, ML, and deep learning