NLTK, spaCy & Gensim
Classic NLP libraries for preprocessing, pipelines, topic modeling, and embeddings.
NLTK (Natural Language Toolkit)
NLTK: The Grandfather of Python NLP
The Natural Language Toolkit (NLTK) is a leading platform for building Python programs to work with human language data. Originally released in 2001 by Steven Bird and Edward Loper, it provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet.
While modern production environments lean towards spaCy or Transformers for raw speed and deep learning integration, NLTK remains the undisputed champion for teaching and learning the foundational algorithms of computational linguistics.
Level 1 — Corpus Access and Basic Processing
NLTK's biggest strength is its built-in access to massive amounts of text data (corpora) that you can use to train and test your algorithms instantly.
import nltk
# 1. Download necessary datasets (Run once)
# nltk.download('gutenberg')
# nltk.download('punkt')
# nltk.download('stopwords')
# 2. Accessing built-in text (Jane Austen's Emma)
from nltk.corpus import gutenberg
emma = gutenberg.raw('austen-emma.txt')
print(f"Total characters in Emma: {len(emma)}")
# 3. Sentence and Word Tokenization
from nltk.tokenize import sent_tokenize, word_tokenize
sample_text = "Hello there! How are you doing today? I hope you are learning NLP."
# Split into sentences
sentences = sent_tokenize(sample_text)
print("Sentences:", sentences)
# Split into words
words = word_tokenize(sample_text)
print("Words:", words)
# 4. Removing Stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words and word.isalnum()]
print("Filtered Words:", filtered_words)
Level 2 — Part-of-Speech Tagging and Chunking
Beyond simple tokenization, NLTK allows you to analyze the grammatical structure of sentences. This involves tagging words with their Part-of-Speech (POS) and grouping them into meaningful "Chunks" (like Noun Phrases).
Common POS Tags (Penn Treebank)
- NN: Noun, singular or mass
- VB: Verb, base form
- JJ: Adjective
- RB: Adverb
- IN: Preposition or subordinating conjunction
import nltk
# nltk.download('averaged_perceptron_tagger')
sentence = "The little yellow dog barked at the angry cat."
tokens = nltk.word_tokenize(sentence)
# 1. Part of Speech Tagging
tagged_words = nltk.pos_tag(tokens)
print("Tagged Words:", tagged_words)
# [('The', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'), ('the', 'DT'), ('angry', 'JJ'), ('cat', 'NN'), ('.', '.')]
# 2. Chunking (Extracting Noun Phrases)
# Define a grammar rule using Regular Expressions:
# "An optional Determiner, followed by zero or more Adjectives, followed by a Noun"
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunk_parser = nltk.RegexpParser(grammar)
tree = chunk_parser.parse(tagged_words)
# Print chunks that match our NP rule
for subtree in tree.subtrees():
if subtree.label() == 'NP':
print("Noun Phrase Found:", subtree.leaves())
# Noun Phrase Found: [('The', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN')]
# Noun Phrase Found: [('the', 'DT'), ('angry', 'JJ'), ('cat', 'NN')]
Level 3 — WordNet and Semantic Similarity
NLTK includes an interface to WordNet, a massive lexical database of English where nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets).
from nltk.corpus import wordnet as wn
# nltk.download('wordnet')
# 1. Get Synsets (Meanings) for a word
synsets = wn.synsets('bank')
for syn in synsets:
print(f"Meaning: {syn.name()} - {syn.definition()}")
# Meaning: bank.n.01 - depository financial institution
# Meaning: bank.n.02 - sloping land (especially the slope beside a body of water)
# 2. Get Synonyms (Lemmas)
good_synonyms = []
for syn in wn.synsets("good"):
for lemma in syn.lemmas():
good_synonyms.append(lemma.name())
print("Synonyms for 'good':", set(good_synonyms))
# 3. Calculate Semantic Similarity (Path Distance between concepts)
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
car = wn.synset('car.n.01')
print(f"Similarity (Dog / Cat): {dog.path_similarity(cat):.2f}") # ~0.20 (Closely related animals)
print(f"Similarity (Dog / Car): {dog.path_similarity(car):.2f}") # ~0.07 (Barely related)
spaCy
spaCy: Industrial-Strength NLP
Unlike NLTK which was built for teaching, spaCy was built specifically for production software. Written in Cython for memory management and blazing speed, it provides a single, highly-optimized algorithm for each task rather than offering a menu of choices.
spaCy excels at large-scale information extraction tasks, providing pre-trained statistical neural network models for over 23 languages.
Level 1 — The Object-Oriented Pipeline
In spaCy, you load a language model which creates a processing nlp pipeline. When you pass text through this pipeline, spaCy instantly tokenizes, POS-tags, lemmatizes, and parses dependencies, returning a rich Doc object.
Downloading Models
Before running spaCy scripts, you must download a pre-trained model for your language via the terminal:
python -m spacy download en_core_web_sm (Small English Model ~12MB)python -m spacy download en_core_web_trf (Transformer-based English Model ~400MB)
import spacy
import pandas as pd
# Load the small English model pipeline
nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying U.K. startup for $1 billion."
# Process the text - this instantly runs Tokenizer, Tagger, Parser, NER, and Lemmatizer!
doc = nlp(text)
# We can easily extract the rich linguistic data into a DataFrame for viewing
token_data = []
for token in doc:
token_data.append([
token.text, # The exact text
token.lemma_, # The root form (buying -> buy)
token.pos_, # Part of speech (VERB, PROPN)
token.dep_, # Syntactic dependency (Subject, Object)
token.is_stop, # Is it a stopword? (True/False)
token.is_alpha # Is it alphabetical? (True/False)
])
df = pd.DataFrame(token_data, columns=['Text', 'Lemma', 'POS', 'Dependency', 'Is Stop', 'Is Alpha'])
print(df.head(6))
Level 2 — Named Entity Recognition (NER)
One of spaCy's absolute strongest features out-of-the-box is its highly accurate Named Entity Recognizer, which can identify people, places, organizations, money, dates, and more.
text2 = "On July 20, 1969, Neil Armstrong walked on the Moon. NASA spent roughly $25 billion on the Apollo program."
doc2 = nlp(text2)
print(f"{'Entity Text':<20} | {'Label':<10} | {'Explanation'}")
print("-" * 60)
for ent in doc2.ents:
# ent.label_ gives the code (e.g. 'GPE'), spacy.explain() translates it to human text
print(f"{ent.text:<20} | {ent.label_:<10} | {spacy.explain(ent.label_)}")
# Output:
# July 20, 1969 | DATE | Absolute or relative dates or periods
# Neil Armstrong | PERSON | People, including fictional
# the Moon | LOC | Non-GPE locations, mountain ranges, bodies of water
# NASA | ORG | Companies, agencies, institutions, etc.
# roughly $25 billion | MONEY | Monetary values, including unit
Level 3 — Custom Rule-Based Matching
While Regex operates on raw strings, spaCy's Matcher operates on Doc objects and token attributes. This allows you to write powerful rules based on grammar rather than just character sequences.
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
# Let's find patterns where a verb is followed by a specific pronoun
# e.g., "loved it", "hated him", "saw them"
pattern = [
{"POS": "VERB"}, # Match any verb
{"POS": "PRON"} # Match any pronoun immediately after
]
matcher.add("VERB_PRON_PATTERN", [pattern])
doc3 = nlp("I loved it! But my brother absolutely hated him for what he did.")
matches = matcher(doc3)
for match_id, start, end in matches:
matched_span = doc3[start:end]
print(f"Found match: '{matched_span.text}'")
# Found match: 'loved it'
# Found match: 'hated him'
spacy.displacy that can render beautiful HTML visuals of syntax dependency trees and highlighted named entities directly in your browser or Jupyter Notebook!
Gensim
Gensim: Topic Modeling for Humans
Gensim (meaning "Generate Similar") is designed to process extremely large text collections using data-streaming algorithms without needing to load the entire dataset into RAM. It is arguably the most famous library in the world for Topic Modeling (Latent Dirichlet Allocation) and Static Word Embeddings (Word2Vec).
Level 1 — Training Your Own Word2Vec Details
Pre-trained word embeddings from Google are great, but what if your company works in a heavily specialized medical or legal field? General embeddings won't understand your domain-specific jargon! Gensim allows you to easily train an embedding model strictly on your own documents.
from gensim.models import Word2Vec
# Step 1: Prepare your data as a list of lists of words (tokenized sentences)
# In reality, this would be a massive stream of millions of sentences from a medical journal.
corpus_sentences = [
['the', 'patient', 'presented', 'with', 'severe', 'cardiac', 'hypertrophy'],
['echocardiogram', 'showed', 'thickening', 'of', 'ventricular', 'walls'],
['prescribed', 'beta', 'blockers', 'to', 'reduce', 'cardiac', 'strain'],
['patient', 'responded', 'well', 'to', 'the', 'prescribed', 'medication']
]
# Step 2: Train the Model
# vector_size = math dimensions (usually 100-300)
# window = how many adjacent words to look at for context
# min_count = ignores words that appear less than this many times
model = Word2Vec(sentences=corpus_sentences, vector_size=50, window=3, min_count=1, workers=4)
# Step 3: Extract the Mathematical Vectors
vector_for_cardiac = model.wv['cardiac']
print(f"Vector (first 5 dims): {vector_for_cardiac[:5]}")
# Step 4: Find Semantic Similarities within your Custom Domain!
similar_words = model.wv.most_similar('cardiac', topn=2)
print("Most similar to 'cardiac':", similar_words)
# e.g., Output: [('hypertrophy', 0.1554), ('strain', 0.1203)] -> It learned medical context!
Level 2 — Topic Modeling with LDA
Latent Dirichlet Allocation (LDA) automatically groups thousands of articles into different "topics" by statistically analyzing which words frequently co-occur together in the same documents.
from gensim import corpora, models
# Suppose we have three documents: one about tech, one about finance, one about sports.
doc_texts = [
["apple", "iphone", "release", "software", "update", "battery"],
["federal", "reserve", "interest", "rate", "inflation", "market", "economy"],
["team", "coach", "football", "goal", "championship", "stadium"]
]
# 1. Create a dictionary representation of the documents (maps words to integer IDs)
dictionary = corpora.Dictionary(doc_texts)
# 2. Convert dictionary to a Bag-of-Words Corpus (List of (token_id, token_count) tuples)
corpus = [dictionary.doc2bow(text) for text in doc_texts]
# 3. Train the LDA Model
# We ask it to find 3 distinct abstract topics hidden inside the corpus
lda_model = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=3, passes=10, random_state=42)
# 4. View the topics it discovered!
print("The 3 discovered topics are \n" + "-" * 30)
for idx, topic in lda_model.print_topics(-1):
print(f"Topic: {idx} \nWords: {topic}\n")
# Topic: 0 -> 0.143*"federal" + 0.143*"reserve" + 0.143*"interest" + 0.143*"rate" + etc.
pyLDAvis, which generates a highly interactive, D3.js powered bubble chart of all the different topics discovered, allowing you to visually explore the cluster distributions in a web browser!
TextBlob
TextBlob
TextBlob is a Python library built for absolute beginners. It wraps around NLTK and the Pattern library. Instead of forcing you to create objects, parse lists, or use machine learning classes, everything in TextBlob is treated exactly like standard Python string objects that inexplicably have magical functions attached to them.
Level 1 — The Ultimate Shortcut
from textblob import TextBlob
blob = TextBlob("I havv a terrrible headach and I feell misrable.")
# 1. Instant Spell Checking (Uses statistical pattern matching under the hood)
print("Corrected:", blob.correct())
# Corrected: I have a terrible headache and I feel miserable.
# 2. Instant Translation (Requires Internet - pinging Google Translate API!)
french_blob = blob.translate(from_lang='en', to='fr')
print("French:", french_blob)
# 3. Instant Pluralization and Singularization
word1 = TextBlob("octopus")
print("Plural:", word1.words[0].pluralize()) # octopi
word2 = TextBlob("geese")
print("Singular:", word2.words[0].singularize()) # goose
# 4. N-Grams natively
print("Trigrams:", blob.ngrams(n=3))
# [WordList(['I', 'havv', 'a']), WordList(['havv', 'a', 'terrrible']), ...]