N-grams
Preserve local word order context by grouping continuous sequences of n items from a given text.
N-grams
Bag of Words (BoW) completely destroys sentence structure because it treats every word as an independent token (unigram). N-grams solve this by taking contiguous sequences of n words from a given text. This allows the model to capture local context and short grammatical structures.
Types of N-grams
Consider the sentence: "The weather is very good"
- Unigrams (n=1): ["The", "weather", "is", "very", "good"] (This is standard BoW!)
- Bigrams (n=2): ["The weather", "weather is", "is very", "very good"] (Captures 2-word contexts)
- Trigrams (n=3): ["The weather is", "weather is very", "is very good"]
Why are N-grams crucial for NLP?
1. Resolving Negations
Standard BoW misses negation. If an angry review says "not good", a Unigram BoW treats "not" and "good" separately, which might confuse the classifier into thinking positive sentiment exists. A Bigram model explicitly tracks "not good" as a single feature variable indicating negativity.
2. Named Entities
Names and monuments are often multi-word. "New York" means a city, whereas "New" and "York" separately mean an adjective and an English town. Bigrams capture ["New York"] precisely.
from sklearn.feature_extraction.text import CountVectorizer
docs = ["The food is not good but the service is very good"]
# ngram_range=(min_n, max_n).
# Example (2,2) means ONLY Bigrams. (1,2) means Unigrams AND Bigrams.
vectorizer = CountVectorizer(ngram_range=(2, 3))
X = vectorizer.fit_transform(docs)
# Print all extracted N-gram features
features = vectorizer.get_feature_names_out()
print("Extracted N-grams:")
for f in features:
print(f"- '{f}'")
'''
Output:
Extracted N-grams:
- 'but the'
- 'but the service'
- 'food is'
- 'food is not'
...
- 'is not good' <-- Trigram successfully captured the true sentiment
- 'not good' <-- Bigram captured negation
- 'very good'
'''