Text Classification
Tutorial Section
Text Classification
Master text classification from spam detection to news categorization using Naive Bayes, Logistic Regression, and BERT.
The Foundation of Categorization
Text Classification is the task of automatically assigning a predefined category label to a piece of text. It is one of the most commercially valuable NLP applications.
Spam Detection
Email → Spam or Ham
News Categories
Article → Sports / Tech / Politics
Sentiment
Review → Positive / Negative
Language ID
Text → English / French / Hindi
The Standard Pipeline
Raw Text
→
Preprocess
→
Vectorize
→
Train Model
→
Predict Label
Level 1 — Naive Bayes (Classic Baseline)
Naive Bayes calculates the probability that each label generated the words in the document. It is extremely fast and works well for spam detection.
How Naive Bayes Thinks
Email: "Free money prize win!"
| Label | P("Free") | P("money") | P("prize") | Combined |
|---|---|---|---|---|
| SPAM | 0.35 | 0.40 | 0.45 | 0.063 |
| HAM | 0.02 | 0.05 | 0.01 | 0.00001 |
SPAM score is much higher → Email classified as SPAM!
Python: Scikit-Learn Naive Bayes
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
train_texts = [
"Win a free iPhone now!", "Click here for free money",
"Limited time offer, claim your prize",
"Meeting rescheduled to 3pm", "Please review the attached report",
"Can we connect for a quick call tomorrow?"
]
train_labels = ["spam", "spam", "spam", "ham", "ham", "ham"]
model = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english')),
('clf', MultinomialNB())
])
model.fit(train_texts, train_labels)
test_texts = ["Get your free reward now!", "Let's schedule a team meeting"]
for text, label in zip(test_texts, model.predict(test_texts)):
icon = "SPAM" if label == "spam" else "HAM"
print(f"[{icon}] {text}")
Level 2 — Zero-Shot BERT (No Labels Needed!)
Zero-Shot Classification with BERT lets you categorize text
into ANY label you describe in plain English — without needing ANY labelled training examples!
Python: Hugging Face Zero-Shot
from transformers import pipeline
classifier = pipeline("zero-shot-classification",
model="facebook/bart-large-mnli")
text = "Apple announced record quarterly earnings driven by iPhone sales."
labels = ["Technology", "Finance", "Sports", "Politics"]
result = classifier(text, labels)
print("Label Scores:")
for label, score in zip(result['labels'], result['scores']):
bar = "█" * int(score * 30)
print(f" {label:<12} {score:.1%} {bar}")