Lemmatization Tutorial Section

Lemmatization

Learn Lemmatization to convert words into their proper dictionary lemmas using WordNet and morphology.

Lemmatization

Lemmatization, unlike Stemming, reduces words to their valid dictionary base form, known as the lemma. While stemming blindly chops off letters, lemmatization uses vocabulary and morphological analysis of words, referring to a dictionary like WordNet.

Stemming vs Lemmatization

Original Word Stemming Lemmatization
Studyingstudistudy
Betterbettergood
Micemicemouse
Was / Werewa / werebe

Lemmatization is computationally more expensive but yields much higher quality, readable text.

The Importance of POS Context

To correctly lemmatize an irregular word, the algorithm must know its Part-Of-Speech (POS) context. For example, look at the word "saw":

  • "He saw a bird." (Verb) → lemma is see
  • "He cut it with a saw." (Noun) → lemma is saw

NLTK's lemmatizer defaults to Noun if you don't provide the POS tag!

NLTK vs spaCy Lemmatization

In industry practice, spaCy is preferred for lemmatization because it automatically calculates POS tags in the background before applying lemmatization, avoiding manual tag definitions.

NLTK (Requires Manual POS)
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Default POS is 'n' (Noun)
print(lemmatizer.lemmatize("better"))   
# Output: better (Incorrect!)

# Changing POS to 'a' (Adjective)
print(lemmatizer.lemmatize("better", pos="a")) 
# Output: good (Correct!)

# Changing POS to 'v' (Verb)
print(lemmatizer.lemmatize("running", pos="v")) 
# Output: run (Correct!)
spaCy (Automatic POS)
import spacy

# Load language model
nlp = spacy.load("en_core_web_sm")

text = "The mice were running better than before."
doc = nlp(text)

print(f"{'Word':<10} | {'Lemma':<10}")
print("-" * 25)
for token in doc:
    print(f"{token.text:<10} | {token.lemma_:<10}")

# Output:
# mice       | mouse
# were       | be
# running    | run
# better     | well