ROUGE Score
Tutorial
ROUGE Score
Recall-Oriented Understudy for Gisting Evaluation.
ROUGE Score
While BLEU focuses on Precision (translation), ROUGE focuses on Recall. It is primarily used to evaluate Text Summarization.
Level 1 — The Variants
There are three main types of ROUGE:
- ROUGE-1: Overlap of individual words.
- ROUGE-2: Overlap of word pairs (bigrams).
- ROUGE-L: Based on the Longest Common Subsequence (captures sentence structure better).
Level 2 — Precision vs Recall
A summary needs to capture the "gist" of the original. High ROUGE recall means the model caught all the important points. High ROUGE precision means the model didn't include unnecessary "fluff".
Level 3 — Use Cases
ROUGE is crucial for training LLMs on summarization tasks. However, it can be "gamed" by repeating words, so it's often used alongside BERTScore for better semantic matching.
ROUGE Score in Python
# pip install rouge
from rouge import Rouge
rouge = Rouge()
hypothesis = "I like natural language processing."
reference = "I love natural language processing."
scores = rouge.get_scores(hypothesis, reference)
print(scores[0]['rouge-l'])