ROUGE Score Tutorial

ROUGE Score

Recall-Oriented Understudy for Gisting Evaluation.

ROUGE Score

While BLEU focuses on Precision (translation), ROUGE focuses on Recall. It is primarily used to evaluate Text Summarization.

Level 1 — The Variants

There are three main types of ROUGE:

  • ROUGE-1: Overlap of individual words.
  • ROUGE-2: Overlap of word pairs (bigrams).
  • ROUGE-L: Based on the Longest Common Subsequence (captures sentence structure better).

Level 2 — Precision vs Recall

A summary needs to capture the "gist" of the original. High ROUGE recall means the model caught all the important points. High ROUGE precision means the model didn't include unnecessary "fluff".

Level 3 — Use Cases

ROUGE is crucial for training LLMs on summarization tasks. However, it can be "gamed" by repeating words, so it's often used alongside BERTScore for better semantic matching.

ROUGE Score in Python
# pip install rouge
from rouge import Rouge

rouge = Rouge()
hypothesis = "I like natural language processing."
reference = "I love natural language processing."

scores = rouge.get_scores(hypothesis, reference)
print(scores[0]['rouge-l'])