NLTK (Natural Language Toolkit) Tutorial

NLTK (Natural Language Toolkit)

Comprehensive educational library for learning classical NLP with extensive datasets and corpora.

NLTK: The Grandfather of Python NLP

The Natural Language Toolkit (NLTK) is a leading platform for building Python programs to work with human language data. Originally released in 2001 by Steven Bird and Edward Loper, it provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet.

While modern production environments lean towards spaCy or Transformers for raw speed and deep learning integration, NLTK remains the undisputed champion for teaching and learning the foundational algorithms of computational linguistics.

Level 1 — Corpus Access and Basic Processing

NLTK's biggest strength is its built-in access to massive amounts of text data (corpora) that you can use to train and test your algorithms instantly.

Accessing Corpora & Basic Tokenization

import nltk

# 1. Download necessary datasets (Run once)
# nltk.download('gutenberg')
# nltk.download('punkt')
# nltk.download('stopwords')

# 2. Accessing built-in text (Jane Austen's Emma)
from nltk.corpus import gutenberg
emma = gutenberg.raw('austen-emma.txt')
print(f"Total characters in Emma: {len(emma)}")

# 3. Sentence and Word Tokenization
from nltk.tokenize import sent_tokenize, word_tokenize

sample_text = "Hello there! How are you doing today? I hope you are learning NLP."

# Split into sentences
sentences = sent_tokenize(sample_text)
print("Sentences:", sentences)

# Split into words
words = word_tokenize(sample_text)
print("Words:", words)

# 4. Removing Stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

filtered_words = [word for word in words if word.lower() not in stop_words and word.isalnum()]
print("Filtered Words:", filtered_words)

Level 2 — Part-of-Speech Tagging and Chunking

Beyond simple tokenization, NLTK allows you to analyze the grammatical structure of sentences. This involves tagging words with their Part-of-Speech (POS) and grouping them into meaningful "Chunks" (like Noun Phrases).

Common POS Tags (Penn Treebank)

NN: Noun, singular or mass
VB: Verb, base form
JJ: Adjective
RB: Adverb
IN: Preposition or subordinating conjunction

POS Tagging & Regex Chunking

import nltk
# nltk.download('averaged_perceptron_tagger')

sentence = "The little yellow dog barked at the angry cat."
tokens = nltk.word_tokenize(sentence)

# 1. Part of Speech Tagging
tagged_words = nltk.pos_tag(tokens)
print("Tagged Words:", tagged_words)
# [('The', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'), ('the', 'DT'), ('angry', 'JJ'), ('cat', 'NN'), ('.', '.')]

# 2. Chunking (Extracting Noun Phrases)
# Define a grammar rule using Regular Expressions: 
# "An optional Determiner, followed by zero or more Adjectives, followed by a Noun"
grammar = "NP: {<DT>?<JJ>*<NN>}"

chunk_parser = nltk.RegexpParser(grammar)
tree = chunk_parser.parse(tagged_words)

# Print chunks that match our NP rule
for subtree in tree.subtrees():
    if subtree.label() == 'NP':
        print("Noun Phrase Found:", subtree.leaves())
        
# Noun Phrase Found: [('The', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN')]
# Noun Phrase Found: [('the', 'DT'), ('angry', 'JJ'), ('cat', 'NN')]

Level 3 — WordNet and Semantic Similarity

NLTK includes an interface to WordNet, a massive lexical database of English where nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets).

Exploring Synonyms and Path Similarity

from nltk.corpus import wordnet as wn
# nltk.download('wordnet')

# 1. Get Synsets (Meanings) for a word
synsets = wn.synsets('bank')
for syn in synsets:
    print(f"Meaning: {syn.name()} - {syn.definition()}")
    
# Meaning: bank.n.01 - depository financial institution
# Meaning: bank.n.02 - sloping land (especially the slope beside a body of water)

# 2. Get Synonyms (Lemmas)
good_synonyms = []
for syn in wn.synsets("good"):
    for lemma in syn.lemmas():
        good_synonyms.append(lemma.name())
print("Synonyms for 'good':", set(good_synonyms))

# 3. Calculate Semantic Similarity (Path Distance between concepts)
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
car = wn.synset('car.n.01')

print(f"Similarity (Dog / Cat): {dog.path_similarity(cat):.2f}") # ~0.20 (Closely related animals)
print(f"Similarity (Dog / Car): {dog.path_similarity(car):.2f}") # ~0.07 (Barely related)

NLTK vs spaCy: Use NLTK when you want to explore how NLP algorithms work internally, test multiple different stemming algorithms (Porter vs Snowball), or access built-in classic datasets. Use spaCy when you need to process 10 million tweets as fast as possible in a production pipeline.

Previous: DistilBERT