Stemming
Tutorial Section
Stemming
Understand Stemming algorithms like Porter Stemmer and Snowball to reduce words to their base root form.
Stemming Techniques
Stemming is a text normalization technique that reduces words to their base or root form by chopping off prefixes or suffixes according to a fixed set of rules. It is a heuristic process that operates purely on string manipulation.
"running" , "runner" , "ran" "run"
Porter Stemmer
One of the oldest (1980) and most widely used suffix stripping algorithms. It uses 5 phases of word reduction.
- "ponies" → "poni"
- "caresses" → "caress"
Snowball Stemmer
Also known as the Porter2 stemmer. It is a slightly faster and more logical algorithm than the original Porter stemmer, supporting multiple languages.
Lancaster Stemmer
The most aggressive stemming algorithm. It is very fast but often chops words down to unreadable levels.
- "maximum" → "maxim"
Over-stemming
When two words with different meanings are stemmed to the same root.
"universal", "university", "universe" → "univers"
When two words with different meanings are stemmed to the same root.
"universal", "university", "universe" → "univers"
Under-stemming
When two words with the same meaning are stemmed to different roots.
"alumnus", "alumni" → "alumnus", "alumni"
When two words with the same meaning are stemmed to different roots.
"alumnus", "alumni" → "alumnus", "alumni"
Comparing NLTK Stemmers in Python
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
porter = PorterStemmer()
snowball = SnowballStemmer("english")
lancaster = LancasterStemmer()
words = ["running", "generously", "history", "historical", "better", "universities"]
print(f"{'Word':<15} | {'Porter':<15} | {'Snowball':<15} | {'Lancaster':<15}")
print("-" * 65)
for w in words:
p_stem = porter.stem(w)
s_stem = snowball.stem(w)
l_stem = lancaster.stem(w)
print(f"{w:<15} | {p_stem:<15} | {s_stem:<15} | {l_stem:<15}")
# Output observation:
# 'better' stays 'better' across all (stemming fails with irregular verbs)
# 'universities' -> 'univers' (Porter/Snowball) -> 'univers' (Lancaster)
# 'historical' -> 'histor' (Porter/Snowball) -> 'hist' (Lancaster - highly aggressive)