N-grams Tutorial Section

N-grams

Preserve local word order context by grouping continuous sequences of n items from a given text.

N-grams

Bag of Words (BoW) completely destroys sentence structure because it treats every word as an independent token (unigram). N-grams solve this by taking contiguous sequences of n words from a given text. This allows the model to capture local context and short grammatical structures.

Types of N-grams

Consider the sentence: "The weather is very good"

  • Unigrams (n=1): ["The", "weather", "is", "very", "good"] (This is standard BoW!)
  • Bigrams (n=2): ["The weather", "weather is", "is very", "very good"] (Captures 2-word contexts)
  • Trigrams (n=3): ["The weather is", "weather is very", "is very good"]

Why are N-grams crucial for NLP?

1. Resolving Negations

Standard BoW misses negation. If an angry review says "not good", a Unigram BoW treats "not" and "good" separately, which might confuse the classifier into thinking positive sentiment exists. A Bigram model explicitly tracks "not good" as a single feature variable indicating negativity.

2. Named Entities

Names and monuments are often multi-word. "New York" means a city, whereas "New" and "York" separately mean an adjective and an English town. Bigrams capture ["New York"] precisely.

The Curse of Dimensionality: Why don't we use 10-grams everywhere? Because the mathematical combinations explode! If a document has 10,000 unique words, it could potentially have 10,000 squared (~100 Million) unique bigrams. Too many features causes machine learning algorithms to overfit and crash due to RAM exhaustion. In practice, n=2 (Bigrams) and n=3 (Trigrams) are the maximum commonly used.
Implementing Bigrams & Trigrams with Scikit-Learn
from sklearn.feature_extraction.text import CountVectorizer

docs = ["The food is not good but the service is very good"]

# ngram_range=(min_n, max_n). 
# Example (2,2) means ONLY Bigrams. (1,2) means Unigrams AND Bigrams.
vectorizer = CountVectorizer(ngram_range=(2, 3))

X = vectorizer.fit_transform(docs)

# Print all extracted N-gram features
features = vectorizer.get_feature_names_out()
print("Extracted N-grams:")
for f in features:
    print(f"- '{f}'")

'''
Output:
Extracted N-grams:
- 'but the'
- 'but the service'
- 'food is'
- 'food is not'
...
- 'is not good'   <-- Trigram successfully captured the true sentiment
- 'not good'      <-- Bigram captured negation
- 'very good'
'''