NLTK (Natural Language Toolkit)
Comprehensive educational library for learning classical NLP with extensive datasets and corpora.
NLTK: The Grandfather of Python NLP
The Natural Language Toolkit (NLTK) is a leading platform for building Python programs to work with human language data. Originally released in 2001 by Steven Bird and Edward Loper, it provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet.
While modern production environments lean towards spaCy or Transformers for raw speed and deep learning integration, NLTK remains the undisputed champion for teaching and learning the foundational algorithms of computational linguistics.
Level 1 — Corpus Access and Basic Processing
NLTK's biggest strength is its built-in access to massive amounts of text data (corpora) that you can use to train and test your algorithms instantly.
import nltk
# 1. Download necessary datasets (Run once)
# nltk.download('gutenberg')
# nltk.download('punkt')
# nltk.download('stopwords')
# 2. Accessing built-in text (Jane Austen's Emma)
from nltk.corpus import gutenberg
emma = gutenberg.raw('austen-emma.txt')
print(f"Total characters in Emma: {len(emma)}")
# 3. Sentence and Word Tokenization
from nltk.tokenize import sent_tokenize, word_tokenize
sample_text = "Hello there! How are you doing today? I hope you are learning NLP."
# Split into sentences
sentences = sent_tokenize(sample_text)
print("Sentences:", sentences)
# Split into words
words = word_tokenize(sample_text)
print("Words:", words)
# 4. Removing Stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words and word.isalnum()]
print("Filtered Words:", filtered_words)
Level 2 — Part-of-Speech Tagging and Chunking
Beyond simple tokenization, NLTK allows you to analyze the grammatical structure of sentences. This involves tagging words with their Part-of-Speech (POS) and grouping them into meaningful "Chunks" (like Noun Phrases).
Common POS Tags (Penn Treebank)
- NN: Noun, singular or mass
- VB: Verb, base form
- JJ: Adjective
- RB: Adverb
- IN: Preposition or subordinating conjunction
import nltk
# nltk.download('averaged_perceptron_tagger')
sentence = "The little yellow dog barked at the angry cat."
tokens = nltk.word_tokenize(sentence)
# 1. Part of Speech Tagging
tagged_words = nltk.pos_tag(tokens)
print("Tagged Words:", tagged_words)
# [('The', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'), ('the', 'DT'), ('angry', 'JJ'), ('cat', 'NN'), ('.', '.')]
# 2. Chunking (Extracting Noun Phrases)
# Define a grammar rule using Regular Expressions:
# "An optional Determiner, followed by zero or more Adjectives, followed by a Noun"
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunk_parser = nltk.RegexpParser(grammar)
tree = chunk_parser.parse(tagged_words)
# Print chunks that match our NP rule
for subtree in tree.subtrees():
if subtree.label() == 'NP':
print("Noun Phrase Found:", subtree.leaves())
# Noun Phrase Found: [('The', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN')]
# Noun Phrase Found: [('the', 'DT'), ('angry', 'JJ'), ('cat', 'NN')]
Level 3 — WordNet and Semantic Similarity
NLTK includes an interface to WordNet, a massive lexical database of English where nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets).
from nltk.corpus import wordnet as wn
# nltk.download('wordnet')
# 1. Get Synsets (Meanings) for a word
synsets = wn.synsets('bank')
for syn in synsets:
print(f"Meaning: {syn.name()} - {syn.definition()}")
# Meaning: bank.n.01 - depository financial institution
# Meaning: bank.n.02 - sloping land (especially the slope beside a body of water)
# 2. Get Synonyms (Lemmas)
good_synonyms = []
for syn in wn.synsets("good"):
for lemma in syn.lemmas():
good_synonyms.append(lemma.name())
print("Synonyms for 'good':", set(good_synonyms))
# 3. Calculate Semantic Similarity (Path Distance between concepts)
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
car = wn.synset('car.n.01')
print(f"Similarity (Dog / Cat): {dog.path_similarity(cat):.2f}") # ~0.20 (Closely related animals)
print(f"Similarity (Dog / Car): {dog.path_similarity(car):.2f}") # ~0.07 (Barely related)