Model Evaluation & Metrics
NLP evaluation, BLEU, ROUGE, METEOR, CIDEr, and benchmark metrics.
Evaluation in NLP
The Challenge of Language Evaluation
In traditional machine learning, we use metrics like Accuracy or MSE. In NLP, evaluation is much harder because a single concept can be expressed in thousands of valid ways.
Level 1 — Intrinsic vs Extrinsic Evaluation
Evaluation in NLP is generally divided into two categories:
Intrinsic
Measures the model on a specific sub-task (e.g., POS tagging accuracy or Perplexity).
Extrinsic
Measures how well the model helps a real-world application (e.g., does it improve Search results?).
Level 2 — Precision, Recall, and F1
For classification (Sentiment, Spam), we use the standard metrics:
- Precision: "Of everything I guessed as spam, how much was actually spam?"
- Recall: "Of all the spam that existed, how much did I catch?"
- F1-Score: The harmonic mean that balances both.
Level 3 — Modern benchmarks (GLUE & SuperGLUE)
Today, researchers use massive suites of tasks like GLUE (General Language Understanding Evaluation) to measure a model's "general intelligence" across multiple different types of language challenges.
from sklearn.metrics import classification_report
y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]
print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive']))
BLEU Score
BLEU Score
BLEU (Bilingual Evaluation Understudy) is the gold standard for measuring the quality of Machine Translation. It compares a machine-translated sentence against one or more human-written reference translations.
Level 1 — The Intuition
BLEU essentially counts how many words (n-grams) from the machine's sentence appear in the human's sentence. The more overlap, the higher the score.
Problem: Brevity
If a machine only outputs one correct word ("The"), it could get 100% precision. BLEU adds a Brevity Penalty to penalize output that is too short.
Level 2 — N-gram Precision
BLEU usually looks at 1-grams, 2-grams, 3-grams, and 4-grams. It calculates the geometric mean of these precisions to get the final score (usually between 0 and 1, or 0 and 100).
Level 3 — Implementation Details
Developers use the SacreBLEU library in research because it standardizes tokenization, ensuring that different papers' results are actually comparable.
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'fast', 'brown', 'fox', 'jumped', 'over', 'the', 'sleepy', 'dog']
score = sentence_bleu(reference, candidate)
print(f"BLEU score: {score}")
ROUGE Score
ROUGE Score
While BLEU focuses on Precision (translation), ROUGE focuses on Recall. It is primarily used to evaluate Text Summarization.
Level 1 — The Variants
There are three main types of ROUGE:
- ROUGE-1: Overlap of individual words.
- ROUGE-2: Overlap of word pairs (bigrams).
- ROUGE-L: Based on the Longest Common Subsequence (captures sentence structure better).
Level 2 — Precision vs Recall
A summary needs to capture the "gist" of the original. High ROUGE recall means the model caught all the important points. High ROUGE precision means the model didn't include unnecessary "fluff".
Level 3 — Use Cases
ROUGE is crucial for training LLMs on summarization tasks. However, it can be "gamed" by repeating words, so it's often used alongside BERTScore for better semantic matching.
# pip install rouge
from rouge import Rouge
rouge = Rouge()
hypothesis = "I like natural language processing."
reference = "I love natural language processing."
scores = rouge.get_scores(hypothesis, reference)
print(scores[0]['rouge-l'])
METEOR Score
METEOR Score
METEOR was created to fix the "rigidity" of BLEU. BLEU gives you 0 points if you use "happy" instead of "glad," but METEOR is smarter.
Level 1 — Flexible Matching
METEOR matches words in three stages:
- Exact: Identical strings.
- Stemming: Same root (walk vs walking).
- Synonymy: Same meaning (big vs large).
Level 2 — Correlation with Humans
Because METEOR understands synonyms, it correlates much more strongly with human judgment than BLEU does. If a human thinks a translation is good, METEOR usually agrees.
Level 3 — Advanced Alignment
METEOR uses a sophisticated alignment algorithm to find the best mapping between the machine output and human reference, resulting in a more reliable score for quality.
from nltk.translate.meteor_score import meteor_score
import nltk
# NLTK requires wordnet for synonyms
nltk.download('wordnet')
reference = ["the quick brown fox"]
candidate = "a fast brown fox"
score = meteor_score([reference[0].split()], candidate.split())
print(f"METEOR score: {score}")
CIDEr Score
CIDEr Score
CIDEr (Consensus-based Image Description Evaluation) is a metric designed specifically for Image Captioning tasks.
Level 1 — The Consensus Method
Instead of matching just one reference, CIDEr compares the machine caption to a "consensus" of 5 or more human captions. It measures how much the machine aligns with what most humans see in the image.
Level 2 — TF-IDF Weighting
CIDEr is smart: it uses TF-IDF to give more weight to rare, descriptive words (like "Dalmatian") and less weight to common words (like "dog"). If you get the rare words right, you get more points.
Level 3 — Handling Noise
By using the consensus of many humans, CIDEr is robust to the "noisy" nature of human descriptions, where different people might focus on different details of a photo.
# CIDEr involves complex TF-IDF vector math over multiple references
# Higher CIDEr => Better alignment with human consensus
score = cider_scorer.compute_score(refs, res)
print(f"CIDEr score: {score}")