N-gram Q&A

N-gram language models – short Q&A

20 questions and answers on n-gram language models, Markov assumptions, smoothing methods and how perplexity evaluates model predictions.

1

What is an n-gram model in NLP?

Answer: An n-gram model is a probabilistic language model that approximates the probability of a word given its previous n−1 words, assuming a Markov dependency of limited order.

2

What is the Markov assumption in n-gram models?

Answer: The Markov assumption states that the probability of a word depends only on a fixed number of preceding words (its history window), not on the entire preceding sentence.

3

How do you estimate n-gram probabilities from a corpus?

Answer: Maximum likelihood estimation uses relative frequencies: P(w_i | w_{i-n+1}^{i-1}) = count(w_{i-n+1}...w_i) / count(w_{i-n+1}...w_{i-1}), possibly followed by smoothing.

4

Why is smoothing necessary in n-gram language models?

Answer: Without smoothing, unseen n-grams get zero probability, which leads to zero probability for any sentence containing them; smoothing redistributes some probability mass to unseen events.

5

What is Laplace (add-one) smoothing, and why is it rarely used in practice?

Answer: Laplace smoothing adds one to all counts before normalizing; it tends to overestimate unseen events and underestimate frequent ones, so more sophisticated methods are preferred.

6

What are some common advanced smoothing techniques?

Answer: Common techniques include Good–Turing, Katz backoff, Kneser–Ney and interpolated Kneser–Ney, which better handle unseen n-grams and back off to lower-order models.

7

What is perplexity, and how is it related to n-gram models?

Answer: Perplexity is an exponential of the average negative log-likelihood per word; it measures how well a language model predicts a test set, with lower perplexity indicating better performance.

8

How does increasing n affect an n-gram model?

Answer: Larger n allows the model to capture longer context but dramatically increases the number of possible n-grams and data sparsity, requiring more data and stronger smoothing.

9

What is backoff in n-gram language models?

Answer: Backoff methods assign probabilities by using higher-order n-gram counts when available and “backing off” to lower-order models when counts are zero or unreliable, adjusting weights appropriately.

10

What is interpolation in the context of n-gram models?

Answer: Interpolation combines probabilities from multiple n-gram orders (e.g. unigram, bigram, trigram) with learned or fixed weights, ensuring all models contribute to the final estimate.

11

How do n-gram models handle the beginning and end of sentences?

Answer: Special start-of-sentence (<s>) and end-of-sentence (</s>) tokens are added so that n-grams can model sentence boundaries and the probability of starting or ending with particular words.

12

What are some limitations of n-gram language models?

Answer: They suffer from data sparsity, limited context, and large memory footprints for high-order models, and they lack the ability to capture long-range dependencies or deep semantics compared to neural LMs.

13

Where are n-gram models still useful today?

Answer: N-gram models remain useful in small-footprint or low-resource scenarios, as baselines in research, and in applications like simple spelling correction or predictive text with limited memory.

14

How can n-gram models be used for next-word prediction?

Answer: Given a history of the last n−1 words, the model computes probabilities for all possible next words and chooses the one with highest probability or samples according to the distribution.

15

What role do vocabulary and OOV handling play in n-gram models?

Answer: N-gram models typically operate on a fixed vocabulary; unseen words are mapped to an <UNK> token, and vocabulary design affects sparsity, memory demands and generalization behavior.

16

How does Kneser–Ney smoothing differ from simpler methods?

Answer: Kneser–Ney not only discounts counts but also defines lower-order probabilities based on how many distinct contexts a word appears in, giving better estimates for rare but widely distributed words.

17

How do n-gram models relate to neural language models?

Answer: Both predict next-word probabilities given a history, but neural language models replace discrete n-gram counts with continuous representations and can capture longer-range patterns and interactions.

18

Why is perplexity usually computed on a held-out test set?

Answer: Perplexity on held-out data measures generalization: if evaluated only on training data, a model can appear artificially good due to overfitting and memorizing observed sequences.

19

What is the relationship between n-gram order and parameter count?

Answer: As n increases, the number of possible n-grams grows exponentially with vocabulary size, so high-order models can have enormous parameter spaces even if many n-grams are never observed.

20

How can n-gram ideas influence modern neural language modeling?

Answer: N-gram insights about context windows, smoothing, and evaluation via perplexity still inform neural LM design, and n-gram models often serve as baselines or components in hybrid systems.

🔍 N-gram modeling concepts covered

This page covers n-gram language modeling: Markov assumptions, smoothing, backoff and interpolation, perplexity and the place of n-gram models in the broader NLP toolkit.

Markov assumption
Smoothing methods
Backoff & interpolation
Perplexity evaluation
N-gram order vs sparsity
N-grams vs neural LMs