Natural Language Processing Tutorial
Master NLP from fundamentals of text processing to advanced transformers like BERT and GPT with practical implementations in NLTK, spaCy, and Hugging Face.
Text Preprocessing
Tokenization, stemming
Word Embeddings
Word2Vec, GloVe
Transformers
BERT, GPT, T5
Libraries
NLTK, spaCy, HF
Introduction to Natural Language Processing
Natural Language Processing (NLP) is a subfield of artificial intelligence that focuses on the interaction between computers and human language. It enables machines to understand, interpret, and generate human language in a valuable way.
Evolution of NLP
- 1950s: Georgetown-IBM experiment (machine translation)
- 1960s: ELIZA (first chatbot)
- 1990s: Statistical NLP approaches
- 2013: Word2Vec (Mikolov et al.)
- 2017: Transformer architecture (Vaswani et al.)
- 2018: BERT (Google), GPT (OpenAI)
- 2020+: Large Language Models (GPT-3, ChatGPT)
Why NLP Matters?
- 90% of world's data is unstructured text
- Automate customer service (chatbots)
- Extract insights from documents
- Enable search and information retrieval
- Break language barriers (translation)
- Power modern AI assistants
NLP Pipeline
Raw Text → Preprocessing → Feature Extraction → Modeling → Evaluation → Deployment
Text Preprocessing
Text preprocessing is the crucial first step in any NLP pipeline. It involves cleaning and preparing raw text for analysis.
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('stopwords')
nltk.download('wordnet')
def preprocess_text(text):
# Convert to lowercase
text = text.lower()
# Remove HTML tags
text = re.sub(r'<.*?>', '', text)
# Remove special characters and digits
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Tokenization
tokens = text.split()
# Remove stopwords
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
# Stemming
stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in tokens]
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in tokens]
return {
'original': text,
'tokens': tokens,
'stemmed': stemmed,
'lemmatized': lemmatized
}
# Example usage
text = "The cats are running quickly in the gardens!"
result = preprocess_text(text)
print(result)
Text Representation
Converting text into numerical representations that machine learning models can understand.
Bag of Words
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'I love NLP',
'NLP is amazing',
'I love learning NLP'
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())
TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'I love NLP',
'NLP is amazing',
'I love learning NLP'
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray().round(2))
Word Embeddings
Word embeddings are dense vector representations that capture semantic meaning and relationships between words.
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
# Sample sentences
sentences = [
"I love natural language processing",
"Word embeddings capture semantic meaning",
"NLP is fascinating and useful",
"Deep learning powers modern NLP"
]
# Tokenize sentences
tokenized_sentences = [word_tokenize(sent.lower()) for sent in sentences]
# Train Word2Vec model
model = Word2Vec(
sentences=tokenized_sentences,
vector_size=100,
window=5,
min_count=1,
workers=4
)
# Find similar words
similar_words = model.wv.most_similar('nlp')
print("Words similar to 'nlp':")
for word, score in similar_words:
print(f"{word}: {score:.3f}")
# Get word vector
vector = model.wv['language']
print(f"\nVector for 'language': {vector[:5]}...")
Classical NLP Models
Part-of-Speech Tagging with NLTK
import nltk
nltk.download('averaged_perceptron_tagger')
text = "The cat sat on the mat"
tokens = nltk.word_tokenize(text)
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)
# Output: [('The', 'DT'), ('cat', 'NN'),
# ('sat', 'VBD'), ('on', 'IN'),
# ('the', 'DT'), ('mat', 'NN')]
Named Entity Recognition
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Apple Inc. is planning to open a new store in New York next year."
doc = nlp(text)
for ent in doc.ents:
print(f"{ent.text}: {ent.label_}")
# Output:
# Apple Inc.: ORG
# New York: GPE
Transformers & Attention
The Transformer architecture revolutionized NLP with its self-attention mechanism.
┌─────────────────┐
│ Input Text │
└────────┬────────┘
▼
┌─────────────────┐
│ Tokenization │
└────────┬────────┘
▼
┌─────────────────┐
│ Embeddings │
└────────┬────────┘
▼
┌─────────────────┐
│ Positional │
│ Encoding │
└────────┬────────┘
▼
┌─────────────────┐
│ Multi-Head │
│ Self-Attention │
└────────┬────────┘
▼
┌─────────────────┐
│ Feed-Forward │
└────────┬────────┘
▼
┌─────────────────┐
│ Output │
└─────────────────┘
Transformer Encoder Block Architecture
import numpy as np
def scaled_dot_product_attention(Q, K, V):
"""
Attention(Q,K,V) = softmax(QK^T/√d_k)V
"""
d_k = K.shape[-1]
scores = np.matmul(Q, K.transpose()) / np.sqrt(d_k)
attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
output = np.matmul(attention_weights, V)
return output, attention_weights
# Example
seq_len, d_k = 3, 4
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)
output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention weights:\n", weights.round(2))
BERT: Bidirectional Encoder Representations from Transformers
BERT revolutionized NLP by pretraining on massive text corpora using masked language modeling.
from transformers import BertTokenizer, BertModel
import torch
# Load pre-trained BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Tokenize input
text = "Natural Language Processing is fascinating!"
inputs = tokenizer(text, return_tensors='pt',
padding=True, truncation=True)
# Get embeddings
with torch.no_grad():
outputs = model(**inputs)
# Last hidden states
last_hidden_states = outputs.last_hidden_state
# Pooled output (CLS token)
pooled_output = outputs.pooler_output
print(f"Sequence length: {last_hidden_states.shape[1]}")
print(f"Hidden size: {last_hidden_states.shape[2]}")
print(f"Pooled output shape: {pooled_output.shape}")
GPT: Generative Pre-trained Transformers
GPT models are autoregressive language models that excel at text generation tasks.
from transformers import GPT2Tokenizer, GPT2LMHeadModel
# Load pre-trained GPT-2
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
# Encode input
input_text = "Natural Language Processing is"
inputs = tokenizer.encode(input_text, return_tensors='pt')
# Generate text
outputs = model.generate(
inputs,
max_length=50,
num_return_sequences=1,
temperature=0.7,
do_sample=True
)
# Decode and print
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
NLP Applications
Sentiment Analysis
from transformers import pipeline
sentiment = pipeline('sentiment-analysis')
result = sentiment("I love this tutorial!")
print(result)
Translation
translator = pipeline(
'translation_en_to_fr'
)
result = translator(
"Hello, how are you?"
)
print(result)
Question Answering
qa = pipeline('question-answering')
result = qa({
'question': 'What is NLP?',
'context': 'NLP stands for Natural Language Processing.'
})
print(result)
NLP Tools & Libraries
NLTK
Classic NLP library
spaCy
Industrial-strength
Hugging Face
Transformers library
Gensim
Topic modeling
Why Master NLP?
- High demand in industry (AI engineers, data scientists)
- Power modern applications (chatbots, search, assistants)
- Foundation for Large Language Models (LLMs)
- Cross-domain applications (healthcare, finance, legal)
- Rapidly evolving field with cutting-edge research
- Combine linguistics, ML, and deep learning