NLP Tutorial

Model Evaluation & Metrics

NLP evaluation, BLEU, ROUGE, METEOR, CIDEr, and benchmark metrics.

Evaluation in NLP

The Challenge of Language Evaluation

In traditional machine learning, we use metrics like Accuracy or MSE. In NLP, evaluation is much harder because a single concept can be expressed in thousands of valid ways.

Level 1 — Intrinsic vs Extrinsic Evaluation

Evaluation in NLP is generally divided into two categories:

Intrinsic

Measures the model on a specific sub-task (e.g., POS tagging accuracy or Perplexity).

Extrinsic

Measures how well the model helps a real-world application (e.g., does it improve Search results?).

Level 2 — Precision, Recall, and F1

For classification (Sentiment, Spam), we use the standard metrics:

  • Precision: "Of everything I guessed as spam, how much was actually spam?"
  • Recall: "Of all the spam that existed, how much did I catch?"
  • F1-Score: The harmonic mean that balances both.

Level 3 — Modern benchmarks (GLUE & SuperGLUE)

Today, researchers use massive suites of tasks like GLUE (General Language Understanding Evaluation) to measure a model's "general intelligence" across multiple different types of language challenges.

Scikit-Learn Evaluation
from sklearn.metrics import classification_report

y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]

print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive']))

BLEU Score

BLEU Score

BLEU (Bilingual Evaluation Understudy) is the gold standard for measuring the quality of Machine Translation. It compares a machine-translated sentence against one or more human-written reference translations.

Level 1 — The Intuition

BLEU essentially counts how many words (n-grams) from the machine's sentence appear in the human's sentence. The more overlap, the higher the score.

Problem: Brevity

If a machine only outputs one correct word ("The"), it could get 100% precision. BLEU adds a Brevity Penalty to penalize output that is too short.

Level 2 — N-gram Precision

BLEU usually looks at 1-grams, 2-grams, 3-grams, and 4-grams. It calculates the geometric mean of these precisions to get the final score (usually between 0 and 1, or 0 and 100).

Level 3 — Implementation Details

Developers use the SacreBLEU library in research because it standardizes tokenization, ensuring that different papers' results are actually comparable.

BLEU Score with NLTK
from nltk.translate.bleu_score import sentence_bleu

reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'fast', 'brown', 'fox', 'jumped', 'over', 'the', 'sleepy', 'dog']

score = sentence_bleu(reference, candidate)
print(f"BLEU score: {score}")

ROUGE Score

ROUGE Score

While BLEU focuses on Precision (translation), ROUGE focuses on Recall. It is primarily used to evaluate Text Summarization.

Level 1 — The Variants

There are three main types of ROUGE:

  • ROUGE-1: Overlap of individual words.
  • ROUGE-2: Overlap of word pairs (bigrams).
  • ROUGE-L: Based on the Longest Common Subsequence (captures sentence structure better).

Level 2 — Precision vs Recall

A summary needs to capture the "gist" of the original. High ROUGE recall means the model caught all the important points. High ROUGE precision means the model didn't include unnecessary "fluff".

Level 3 — Use Cases

ROUGE is crucial for training LLMs on summarization tasks. However, it can be "gamed" by repeating words, so it's often used alongside BERTScore for better semantic matching.

ROUGE Score in Python
# pip install rouge
from rouge import Rouge

rouge = Rouge()
hypothesis = "I like natural language processing."
reference = "I love natural language processing."

scores = rouge.get_scores(hypothesis, reference)
print(scores[0]['rouge-l'])

METEOR Score

METEOR Score

METEOR was created to fix the "rigidity" of BLEU. BLEU gives you 0 points if you use "happy" instead of "glad," but METEOR is smarter.

Level 1 — Flexible Matching

METEOR matches words in three stages:

  1. Exact: Identical strings.
  2. Stemming: Same root (walk vs walking).
  3. Synonymy: Same meaning (big vs large).

Level 2 — Correlation with Humans

Because METEOR understands synonyms, it correlates much more strongly with human judgment than BLEU does. If a human thinks a translation is good, METEOR usually agrees.

Level 3 — Advanced Alignment

METEOR uses a sophisticated alignment algorithm to find the best mapping between the machine output and human reference, resulting in a more reliable score for quality.

METEOR in NLTK
from nltk.translate.meteor_score import meteor_score
import nltk

# NLTK requires wordnet for synonyms
nltk.download('wordnet')

reference = ["the quick brown fox"]
candidate = "a fast brown fox"
score = meteor_score([reference[0].split()], candidate.split())
print(f"METEOR score: {score}")

CIDEr Score

CIDEr Score

CIDEr (Consensus-based Image Description Evaluation) is a metric designed specifically for Image Captioning tasks.

Level 1 — The Consensus Method

Instead of matching just one reference, CIDEr compares the machine caption to a "consensus" of 5 or more human captions. It measures how much the machine aligns with what most humans see in the image.

Level 2 — TF-IDF Weighting

CIDEr is smart: it uses TF-IDF to give more weight to rare, descriptive words (like "Dalmatian") and less weight to common words (like "dog"). If you get the rare words right, you get more points.

Level 3 — Handling Noise

By using the consensus of many humans, CIDEr is robust to the "noisy" nature of human descriptions, where different people might focus on different details of a photo.

CIDEr Concept (Pseudocode)
# CIDEr involves complex TF-IDF vector math over multiple references
# Higher CIDEr => Better alignment with human consensus
score = cider_scorer.compute_score(refs, res)
print(f"CIDEr score: {score}")