Evaluation Intro Tutorial

Evaluation in NLP

Measuring the success rate of NLP tasks using objective mathematical benchmarks.

The Challenge of Language Evaluation

In traditional machine learning, we use metrics like Accuracy or MSE. In NLP, evaluation is much harder because a single concept can be expressed in thousands of valid ways.

Level 1 — Intrinsic vs Extrinsic Evaluation

Evaluation in NLP is generally divided into two categories:

Intrinsic

Measures the model on a specific sub-task (e.g., POS tagging accuracy or Perplexity).

Extrinsic

Measures how well the model helps a real-world application (e.g., does it improve Search results?).

Level 2 — Precision, Recall, and F1

For classification (Sentiment, Spam), we use the standard metrics:

  • Precision: "Of everything I guessed as spam, how much was actually spam?"
  • Recall: "Of all the spam that existed, how much did I catch?"
  • F1-Score: The harmonic mean that balances both.

Level 3 — Modern benchmarks (GLUE & SuperGLUE)

Today, researchers use massive suites of tasks like GLUE (General Language Understanding Evaluation) to measure a model's "general intelligence" across multiple different types of language challenges.

Scikit-Learn Evaluation
from sklearn.metrics import classification_report

y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]

print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive']))