NLP Tutorial

NLTK, spaCy & Gensim

Classic NLP libraries for preprocessing, pipelines, topic modeling, and embeddings.

NLTK (Natural Language Toolkit)

NLTK: The Grandfather of Python NLP

The Natural Language Toolkit (NLTK) is a leading platform for building Python programs to work with human language data. Originally released in 2001 by Steven Bird and Edward Loper, it provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet.

While modern production environments lean towards spaCy or Transformers for raw speed and deep learning integration, NLTK remains the undisputed champion for teaching and learning the foundational algorithms of computational linguistics.

Level 1 — Corpus Access and Basic Processing

NLTK's biggest strength is its built-in access to massive amounts of text data (corpora) that you can use to train and test your algorithms instantly.

Accessing Corpora & Basic Tokenization

import nltk

# 1. Download necessary datasets (Run once)
# nltk.download('gutenberg')
# nltk.download('punkt')
# nltk.download('stopwords')

# 2. Accessing built-in text (Jane Austen's Emma)
from nltk.corpus import gutenberg
emma = gutenberg.raw('austen-emma.txt')
print(f"Total characters in Emma: {len(emma)}")

# 3. Sentence and Word Tokenization
from nltk.tokenize import sent_tokenize, word_tokenize

sample_text = "Hello there! How are you doing today? I hope you are learning NLP."

# Split into sentences
sentences = sent_tokenize(sample_text)
print("Sentences:", sentences)

# Split into words
words = word_tokenize(sample_text)
print("Words:", words)

# 4. Removing Stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

filtered_words = [word for word in words if word.lower() not in stop_words and word.isalnum()]
print("Filtered Words:", filtered_words)

Level 2 — Part-of-Speech Tagging and Chunking

Beyond simple tokenization, NLTK allows you to analyze the grammatical structure of sentences. This involves tagging words with their Part-of-Speech (POS) and grouping them into meaningful "Chunks" (like Noun Phrases).

Common POS Tags (Penn Treebank)

NN: Noun, singular or mass
VB: Verb, base form
JJ: Adjective
RB: Adverb
IN: Preposition or subordinating conjunction

POS Tagging & Regex Chunking

import nltk
# nltk.download('averaged_perceptron_tagger')

sentence = "The little yellow dog barked at the angry cat."
tokens = nltk.word_tokenize(sentence)

# 1. Part of Speech Tagging
tagged_words = nltk.pos_tag(tokens)
print("Tagged Words:", tagged_words)
# [('The', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'), ('the', 'DT'), ('angry', 'JJ'), ('cat', 'NN'), ('.', '.')]

# 2. Chunking (Extracting Noun Phrases)
# Define a grammar rule using Regular Expressions: 
# "An optional Determiner, followed by zero or more Adjectives, followed by a Noun"
grammar = "NP: {<DT>?<JJ>*<NN>}"

chunk_parser = nltk.RegexpParser(grammar)
tree = chunk_parser.parse(tagged_words)

# Print chunks that match our NP rule
for subtree in tree.subtrees():
    if subtree.label() == 'NP':
        print("Noun Phrase Found:", subtree.leaves())
        
# Noun Phrase Found: [('The', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN')]
# Noun Phrase Found: [('the', 'DT'), ('angry', 'JJ'), ('cat', 'NN')]

Level 3 — WordNet and Semantic Similarity

NLTK includes an interface to WordNet, a massive lexical database of English where nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets).

Exploring Synonyms and Path Similarity

from nltk.corpus import wordnet as wn
# nltk.download('wordnet')

# 1. Get Synsets (Meanings) for a word
synsets = wn.synsets('bank')
for syn in synsets:
    print(f"Meaning: {syn.name()} - {syn.definition()}")
    
# Meaning: bank.n.01 - depository financial institution
# Meaning: bank.n.02 - sloping land (especially the slope beside a body of water)

# 2. Get Synonyms (Lemmas)
good_synonyms = []
for syn in wn.synsets("good"):
    for lemma in syn.lemmas():
        good_synonyms.append(lemma.name())
print("Synonyms for 'good':", set(good_synonyms))

# 3. Calculate Semantic Similarity (Path Distance between concepts)
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
car = wn.synset('car.n.01')

print(f"Similarity (Dog / Cat): {dog.path_similarity(cat):.2f}") # ~0.20 (Closely related animals)
print(f"Similarity (Dog / Car): {dog.path_similarity(car):.2f}") # ~0.07 (Barely related)

NLTK vs spaCy: Use NLTK when you want to explore how NLP algorithms work internally, test multiple different stemming algorithms (Porter vs Snowball), or access built-in classic datasets. Use spaCy when you need to process 10 million tweets as fast as possible in a production pipeline.

spaCy

spaCy: Industrial-Strength NLP

Unlike NLTK which was built for teaching, spaCy was built specifically for production software. Written in Cython for memory management and blazing speed, it provides a single, highly-optimized algorithm for each task rather than offering a menu of choices.

spaCy excels at large-scale information extraction tasks, providing pre-trained statistical neural network models for over 23 languages.

Level 1 — The Object-Oriented Pipeline

In spaCy, you load a language model which creates a processing nlp pipeline. When you pass text through this pipeline, spaCy instantly tokenizes, POS-tags, lemmatizes, and parses dependencies, returning a rich Doc object.

Downloading Models

Before running spaCy scripts, you must download a pre-trained model for your language via the terminal:

python -m spacy download en_core_web_sm (Small English Model ~12MB)
python -m spacy download en_core_web_trf (Transformer-based English Model ~400MB)

Processing Text into a Doc Object

import spacy
import pandas as pd

# Load the small English model pipeline
nlp = spacy.load("en_core_web_sm")

text = "Apple is looking at buying U.K. startup for $1 billion."
# Process the text - this instantly runs Tokenizer, Tagger, Parser, NER, and Lemmatizer!
doc = nlp(text)

# We can easily extract the rich linguistic data into a DataFrame for viewing
token_data = []
for token in doc:
    token_data.append([
        token.text,        # The exact text
        token.lemma_,      # The root form (buying -> buy)
        token.pos_,        # Part of speech (VERB, PROPN)
        token.dep_,        # Syntactic dependency (Subject, Object)
        token.is_stop,     # Is it a stopword? (True/False)
        token.is_alpha     # Is it alphabetical? (True/False)
    ])

df = pd.DataFrame(token_data, columns=['Text', 'Lemma', 'POS', 'Dependency', 'Is Stop', 'Is Alpha'])
print(df.head(6))

Level 2 — Named Entity Recognition (NER)

One of spaCy's absolute strongest features out-of-the-box is its highly accurate Named Entity Recognizer, which can identify people, places, organizations, money, dates, and more.

Extracting Entities

text2 = "On July 20, 1969, Neil Armstrong walked on the Moon. NASA spent roughly $25 billion on the Apollo program."
doc2 = nlp(text2)

print(f"{'Entity Text':<20} | {'Label':<10} | {'Explanation'}")
print("-" * 60)
for ent in doc2.ents:
    # ent.label_ gives the code (e.g. 'GPE'), spacy.explain() translates it to human text
    print(f"{ent.text:<20} | {ent.label_:<10} | {spacy.explain(ent.label_)}")

# Output:
# July 20, 1969        | DATE       | Absolute or relative dates or periods
# Neil Armstrong       | PERSON     | People, including fictional
# the Moon             | LOC        | Non-GPE locations, mountain ranges, bodies of water
# NASA                 | ORG        | Companies, agencies, institutions, etc.
# roughly $25 billion  | MONEY      | Monetary values, including unit

Level 3 — Custom Rule-Based Matching

While Regex operates on raw strings, spaCy's Matcher operates on Doc objects and token attributes. This allows you to write powerful rules based on grammar rather than just character sequences.

Grammar-Aware Pattern Matching

from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

# Let's find patterns where a verb is followed by a specific pronoun
# e.g., "loved it", "hated him", "saw them"
pattern = [
    {"POS": "VERB"},             # Match any verb
    {"POS": "PRON"}              # Match any pronoun immediately after
]

matcher.add("VERB_PRON_PATTERN", [pattern])

doc3 = nlp("I loved it! But my brother absolutely hated him for what he did.")
matches = matcher(doc3)

for match_id, start, end in matches:
    matched_span = doc3[start:end]
    print(f"Found match: '{matched_span.text}'")
    
# Found match: 'loved it'
# Found match: 'hated him'

Displacy Visualization: spaCy includes a built-in submodule called spacy.displacy that can render beautiful HTML visuals of syntax dependency trees and highlighted named entities directly in your browser or Jupyter Notebook!

Gensim

Gensim: Topic Modeling for Humans

Gensim (meaning "Generate Similar") is designed to process extremely large text collections using data-streaming algorithms without needing to load the entire dataset into RAM. It is arguably the most famous library in the world for Topic Modeling (Latent Dirichlet Allocation) and Static Word Embeddings (Word2Vec).

Level 1 — Training Your Own Word2Vec Details

Pre-trained word embeddings from Google are great, but what if your company works in a heavily specialized medical or legal field? General embeddings won't understand your domain-specific jargon! Gensim allows you to easily train an embedding model strictly on your own documents.

Training Word2Vec from Scratch

from gensim.models import Word2Vec

# Step 1: Prepare your data as a list of lists of words (tokenized sentences)
# In reality, this would be a massive stream of millions of sentences from a medical journal.
corpus_sentences = [
    ['the', 'patient', 'presented', 'with', 'severe', 'cardiac', 'hypertrophy'],
    ['echocardiogram', 'showed', 'thickening', 'of', 'ventricular', 'walls'],
    ['prescribed', 'beta', 'blockers', 'to', 'reduce', 'cardiac', 'strain'],
    ['patient', 'responded', 'well', 'to', 'the', 'prescribed', 'medication']
]

# Step 2: Train the Model
# vector_size = math dimensions (usually 100-300)
# window = how many adjacent words to look at for context
# min_count = ignores words that appear less than this many times
model = Word2Vec(sentences=corpus_sentences, vector_size=50, window=3, min_count=1, workers=4)

# Step 3: Extract the Mathematical Vectors
vector_for_cardiac = model.wv['cardiac']
print(f"Vector (first 5 dims): {vector_for_cardiac[:5]}")

# Step 4: Find Semantic Similarities within your Custom Domain!
similar_words = model.wv.most_similar('cardiac', topn=2)
print("Most similar to 'cardiac':", similar_words)
# e.g., Output: [('hypertrophy', 0.1554), ('strain', 0.1203)] -> It learned medical context!

Level 2 — Topic Modeling with LDA

Latent Dirichlet Allocation (LDA) automatically groups thousands of articles into different "topics" by statistically analyzing which words frequently co-occur together in the same documents.

Topic Discovery using LDA

from gensim import corpora, models

# Suppose we have three documents: one about tech, one about finance, one about sports.
doc_texts = [
    ["apple", "iphone", "release", "software", "update", "battery"],
    ["federal", "reserve", "interest", "rate", "inflation", "market", "economy"],
    ["team", "coach", "football", "goal", "championship", "stadium"]
]

# 1. Create a dictionary representation of the documents (maps words to integer IDs)
dictionary = corpora.Dictionary(doc_texts)

# 2. Convert dictionary to a Bag-of-Words Corpus (List of (token_id, token_count) tuples)
corpus = [dictionary.doc2bow(text) for text in doc_texts]

# 3. Train the LDA Model
# We ask it to find 3 distinct abstract topics hidden inside the corpus
lda_model = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=3, passes=10, random_state=42)

# 4. View the topics it discovered!
print("The 3 discovered topics are \n" + "-" * 30)
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic: {idx} \nWords: {topic}\n")
    
# Topic: 0 -> 0.143*"federal" + 0.143*"reserve" + 0.143*"interest" + 0.143*"rate" + etc.

pyLDAvis Integration: Gensim works flawlessly with a library called pyLDAvis, which generates a highly interactive, D3.js powered bubble chart of all the different topics discovered, allowing you to visually explore the cluster distributions in a web browser!

TextBlob

TextBlob is a Python library built for absolute beginners. It wraps around NLTK and the Pattern library. Instead of forcing you to create objects, parse lists, or use machine learning classes, everything in TextBlob is treated exactly like standard Python string objects that inexplicably have magical functions attached to them.

Level 1 — The Ultimate Shortcut

TextBlob One-Liners

from textblob import TextBlob

blob = TextBlob("I havv a terrrible headach and I feell misrable.")

# 1. Instant Spell Checking (Uses statistical pattern matching under the hood)
print("Corrected:", blob.correct())
# Corrected: I have a terrible headache and I feel miserable.

# 2. Instant Translation (Requires Internet - pinging Google Translate API!)
french_blob = blob.translate(from_lang='en', to='fr')
print("French:", french_blob)

# 3. Instant Pluralization and Singularization
word1 = TextBlob("octopus")
print("Plural:", word1.words[0].pluralize()) # octopi

word2 = TextBlob("geese")
print("Singular:", word2.words[0].singularize()) # goose

# 4. N-Grams natively
print("Trigrams:", blob.ngrams(n=3))
# [WordList(['I', 'havv', 'a']), WordList(['havv', 'a', 'terrrible']), ...]