Lemmatization
Learn Lemmatization to convert words into their proper dictionary lemmas using WordNet and morphology.
Lemmatization
Lemmatization, unlike Stemming, reduces words to their valid dictionary base form, known as the lemma. While stemming blindly chops off letters, lemmatization uses vocabulary and morphological analysis of words, referring to a dictionary like WordNet.
Stemming vs Lemmatization
| Original Word | Stemming | Lemmatization |
|---|---|---|
| Studying | studi | study |
| Better | better | good |
| Mice | mice | mouse |
| Was / Were | wa / were | be |
Lemmatization is computationally more expensive but yields much higher quality, readable text.
The Importance of POS Context
To correctly lemmatize an irregular word, the algorithm must know its Part-Of-Speech (POS) context. For example, look at the word "saw":
- "He saw a bird." (Verb) → lemma is see
- "He cut it with a saw." (Noun) → lemma is saw
NLTK's lemmatizer defaults to Noun if you don't provide the POS tag!
NLTK vs spaCy Lemmatization
In industry practice, spaCy is preferred for lemmatization because it automatically calculates POS tags in the background before applying lemmatization, avoiding manual tag definitions.
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Default POS is 'n' (Noun)
print(lemmatizer.lemmatize("better"))
# Output: better (Incorrect!)
# Changing POS to 'a' (Adjective)
print(lemmatizer.lemmatize("better", pos="a"))
# Output: good (Correct!)
# Changing POS to 'v' (Verb)
print(lemmatizer.lemmatize("running", pos="v"))
# Output: run (Correct!)
import spacy
# Load language model
nlp = spacy.load("en_core_web_sm")
text = "The mice were running better than before."
doc = nlp(text)
print(f"{'Word':<10} | {'Lemma':<10}")
print("-" * 25)
for token in doc:
print(f"{token.text:<10} | {token.lemma_:<10}")
# Output:
# mice | mouse
# were | be
# running | run
# better | well