Evaluation in NLP
Measuring the success rate of NLP tasks using objective mathematical benchmarks.
The Challenge of Language Evaluation
In traditional machine learning, we use metrics like Accuracy or MSE. In NLP, evaluation is much harder because a single concept can be expressed in thousands of valid ways.
Level 1 — Intrinsic vs Extrinsic Evaluation
Evaluation in NLP is generally divided into two categories:
Intrinsic
Measures the model on a specific sub-task (e.g., POS tagging accuracy or Perplexity).
Extrinsic
Measures how well the model helps a real-world application (e.g., does it improve Search results?).
Level 2 — Precision, Recall, and F1
For classification (Sentiment, Spam), we use the standard metrics:
- Precision: "Of everything I guessed as spam, how much was actually spam?"
- Recall: "Of all the spam that existed, how much did I catch?"
- F1-Score: The harmonic mean that balances both.
Level 3 — Modern benchmarks (GLUE & SuperGLUE)
Today, researchers use massive suites of tasks like GLUE (General Language Understanding Evaluation) to measure a model's "general intelligence" across multiple different types of language challenges.
from sklearn.metrics import classification_report
y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]
print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive']))