Text Classification & Sentiment
Text categorization and sentiment analysis for opinion mining.
Text Classification
The Foundation of Categorization
Text Classification is the task of automatically assigning a predefined category label to a piece of text. It is one of the most commercially valuable NLP applications.
Spam Detection
Email → Spam or Ham
News Categories
Article → Sports / Tech / Politics
Sentiment
Review → Positive / Negative
Language ID
Text → English / French / Hindi
The Standard Pipeline
Level 1 — Naive Bayes (Classic Baseline)
Naive Bayes calculates the probability that each label generated the words in the document. It is extremely fast and works well for spam detection.
How Naive Bayes Thinks
Email: "Free money prize win!"
| Label | P("Free") | P("money") | P("prize") | Combined |
|---|---|---|---|---|
| SPAM | 0.35 | 0.40 | 0.45 | 0.063 |
| HAM | 0.02 | 0.05 | 0.01 | 0.00001 |
SPAM score is much higher → Email classified as SPAM!
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
train_texts = [
"Win a free iPhone now!", "Click here for free money",
"Limited time offer, claim your prize",
"Meeting rescheduled to 3pm", "Please review the attached report",
"Can we connect for a quick call tomorrow?"
]
train_labels = ["spam", "spam", "spam", "ham", "ham", "ham"]
model = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english')),
('clf', MultinomialNB())
])
model.fit(train_texts, train_labels)
test_texts = ["Get your free reward now!", "Let's schedule a team meeting"]
for text, label in zip(test_texts, model.predict(test_texts)):
icon = "SPAM" if label == "spam" else "HAM"
print(f"[{icon}] {text}")
Level 2 — Zero-Shot BERT (No Labels Needed!)
from transformers import pipeline
classifier = pipeline("zero-shot-classification",
model="facebook/bart-large-mnli")
text = "Apple announced record quarterly earnings driven by iPhone sales."
labels = ["Technology", "Finance", "Sports", "Politics"]
result = classifier(text, labels)
print("Label Scores:")
for label, score in zip(result['labels'], result['scores']):
bar = "█" * int(score * 30)
print(f" {label:<12} {score:.1%} {bar}")
Sentiment Analysis
Understanding Emotions in Text
Sentiment Analysis (Opinion Mining) automatically detects the emotional tone in text. It powers brand monitoring, product reviews, and social media analytics worldwide.
Positive
"This product is absolutely amazing! Best purchase ever."
Score: +0.92
Neutral
"The product arrived on Tuesday. It is a standard device."
Score: +0.05
Negative
"Terrible quality. Broke after two days. Complete waste of money."
Score: -0.88
Level 1 — VADER: Rule-Based Lexicon
VADER is a hand-crafted sentiment lexicon requiring zero training data. It automatically handles emojis, ALL CAPS, and negation.
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()
sentences = [
"I LOVE this phone! Absolutely fantastic!",
"The battery life is not bad, but the camera is disappointing.",
"Worst product ever. Complete garbage. DO NOT BUY!",
"It's okay I guess. Nothing special.",
]
for sent in sentences:
scores = sia.polarity_scores(sent)
compound = scores['compound']
label = "POSITIVE" if compound >= 0.05 else ("NEGATIVE" if compound <= -0.05 else "NEUTRAL")
print(f"[{label:>8}] ({compound:+.3f}) {sent[:50]}")
Level 2 — Aspect-Based Sentiment Analysis (ABSA)
Instead of one overall score, ABSA identifies specific aspects and assigns individual sentiments to each.
Restaurant Review Example
"The food was incredibly delicious, but the service was painfully slow and the price is way too expensive."
| Aspect | Sentiment | Keywords |
|---|---|---|
| Food | POSITIVE (+0.91) | delicious |
| Service | NEGATIVE (-0.78) | slow |
| Price | NEGATIVE (-0.65) | expensive |
Level 3 — Transformer Sentiment (Best Accuracy)
from transformers import pipeline
sentiment = pipeline("sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english")
reviews = [
"Absolutely love it! Works perfectly and arrived fast.",
"Not great, not terrible. Does the job I suppose.",
"Horrible. Completely broke after first use. Save your money.",
]
for review in reviews:
result = sentiment(review)[0]
print(f"[{result['label']}] ({result['score']:.1%}) {review[:55]}...")
Key Challenges
| Challenge | Example | Problem |
|---|---|---|
| Sarcasm | "Oh great, another Monday" | Positive words, negative intent. |
| Negation | "Not bad at all" | Grammatically negative, semantically positive. |
| Context | "The movie was predictable" | Good for some genres, bad for others. |
| Domain Shift | "This medicine is sick" | "Sick" means bad in context, amazing in slang. |