BLEU Score Tutorial

BLEU Score

BiLingual Evaluation Understudy for translating benchmarks.

BLEU Score

BLEU (Bilingual Evaluation Understudy) is the gold standard for measuring the quality of Machine Translation. It compares a machine-translated sentence against one or more human-written reference translations.

Level 1 — The Intuition

BLEU essentially counts how many words (n-grams) from the machine's sentence appear in the human's sentence. The more overlap, the higher the score.

Problem: Brevity

If a machine only outputs one correct word ("The"), it could get 100% precision. BLEU adds a Brevity Penalty to penalize output that is too short.

Level 2 — N-gram Precision

BLEU usually looks at 1-grams, 2-grams, 3-grams, and 4-grams. It calculates the geometric mean of these precisions to get the final score (usually between 0 and 1, or 0 and 100).

Level 3 — Implementation Details

Developers use the SacreBLEU library in research because it standardizes tokenization, ensuring that different papers' results are actually comparable.

BLEU Score with NLTK
from nltk.translate.bleu_score import sentence_bleu

reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'fast', 'brown', 'fox', 'jumped', 'over', 'the', 'sleepy', 'dog']

score = sentence_bleu(reference, candidate)
print(f"BLEU score: {score}")