Stemming Tutorial Section

Stemming

Understand Stemming algorithms like Porter Stemmer and Snowball to reduce words to their base root form.

Stemming Techniques

Stemming is a text normalization technique that reduces words to their base or root form by chopping off prefixes or suffixes according to a fixed set of rules. It is a heuristic process that operates purely on string manipulation.

"running" , "runner" , "ran" "run"

Porter Stemmer

One of the oldest (1980) and most widely used suffix stripping algorithms. It uses 5 phases of word reduction.

"ponies" → "poni"
"caresses" → "caress"

Snowball Stemmer

Also known as the Porter2 stemmer. It is a slightly faster and more logical algorithm than the original Porter stemmer, supporting multiple languages.

Lancaster Stemmer

The most aggressive stemming algorithm. It is very fast but often chops words down to unreadable levels.

"maximum" → "maxim"

Over-stemming
When two words with different meanings are stemmed to the same root.
"universal", "university", "universe" → "univers"

Under-stemming
When two words with the same meaning are stemmed to different roots.
"alumnus", "alumni" → "alumnus", "alumni"

Comparing NLTK Stemmers in Python

from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

porter = PorterStemmer()
snowball = SnowballStemmer("english")
lancaster = LancasterStemmer()

words = ["running", "generously", "history", "historical", "better", "universities"]

print(f"{'Word':<15} | {'Porter':<15} | {'Snowball':<15} | {'Lancaster':<15}")
print("-" * 65)

for w in words:
    p_stem = porter.stem(w)
    s_stem = snowball.stem(w)
    l_stem = lancaster.stem(w)
    print(f"{w:<15} | {p_stem:<15} | {s_stem:<15} | {l_stem:<15}")

# Output observation:
# 'better' stays 'better' across all (stemming fails with irregular verbs)
# 'universities' -> 'univers' (Porter/Snowball) -> 'univers' (Lancaster)
# 'historical' -> 'histor' (Porter/Snowball) -> 'hist' (Lancaster - highly aggressive)

Previous: Stop Words Removal