NLP Tutorial

Text Classification & Sentiment

Text categorization and sentiment analysis for opinion mining.

Text Classification

The Foundation of Categorization

Text Classification is the task of automatically assigning a predefined category label to a piece of text. It is one of the most commercially valuable NLP applications.

Spam Detection

Email → Spam or Ham

News Categories

Article → Sports / Tech / Politics

Sentiment

Review → Positive / Negative

Language ID

Text → English / French / Hindi

The Standard Pipeline

Raw Text
Preprocess
Vectorize
Train Model
Predict Label

Level 1 — Naive Bayes (Classic Baseline)

Naive Bayes calculates the probability that each label generated the words in the document. It is extremely fast and works well for spam detection.

How Naive Bayes Thinks

Email: "Free money prize win!"

Label P("Free") P("money") P("prize") Combined
SPAM 0.35 0.40 0.45 0.063
HAM 0.02 0.05 0.01 0.00001

SPAM score is much higher → Email classified as SPAM!

Python: Scikit-Learn Naive Bayes
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

train_texts = [
    "Win a free iPhone now!", "Click here for free money",
    "Limited time offer, claim your prize",
    "Meeting rescheduled to 3pm", "Please review the attached report",
    "Can we connect for a quick call tomorrow?"
]
train_labels = ["spam", "spam", "spam", "ham", "ham", "ham"]

model = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf', MultinomialNB())
])
model.fit(train_texts, train_labels)

test_texts = ["Get your free reward now!", "Let's schedule a team meeting"]
for text, label in zip(test_texts, model.predict(test_texts)):
    icon = "SPAM" if label == "spam" else "HAM"
    print(f"[{icon}] {text}")

Level 2 — Zero-Shot BERT (No Labels Needed!)

Zero-Shot Classification with BERT lets you categorize text into ANY label you describe in plain English — without needing ANY labelled training examples!
Python: Hugging Face Zero-Shot
from transformers import pipeline

classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

text = "Apple announced record quarterly earnings driven by iPhone sales."
labels = ["Technology", "Finance", "Sports", "Politics"]
result = classifier(text, labels)

print("Label Scores:")
for label, score in zip(result['labels'], result['scores']):
    bar = "█" * int(score * 30)
    print(f"  {label:<12} {score:.1%}  {bar}")

Sentiment Analysis

Understanding Emotions in Text

Sentiment Analysis (Opinion Mining) automatically detects the emotional tone in text. It powers brand monitoring, product reviews, and social media analytics worldwide.

😊
Positive

"This product is absolutely amazing! Best purchase ever."
Score: +0.92

😐
Neutral

"The product arrived on Tuesday. It is a standard device."
Score: +0.05

😡
Negative

"Terrible quality. Broke after two days. Complete waste of money."
Score: -0.88

Level 1 — VADER: Rule-Based Lexicon

VADER is a hand-crafted sentiment lexicon requiring zero training data. It automatically handles emojis, ALL CAPS, and negation.

Best for: Social media posts, tweets, short reviews — anything informal.
Python: NLTK VADER
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()

sentences = [
    "I LOVE this phone! Absolutely fantastic!",
    "The battery life is not bad, but the camera is disappointing.",
    "Worst product ever. Complete garbage. DO NOT BUY!",
    "It's okay I guess. Nothing special.",
]

for sent in sentences:
    scores = sia.polarity_scores(sent)
    compound = scores['compound']
    label = "POSITIVE" if compound >= 0.05 else ("NEGATIVE" if compound <= -0.05 else "NEUTRAL")
    print(f"[{label:>8}] ({compound:+.3f}) {sent[:50]}")

Level 2 — Aspect-Based Sentiment Analysis (ABSA)

Instead of one overall score, ABSA identifies specific aspects and assigns individual sentiments to each.

Restaurant Review Example

"The food was incredibly delicious, but the service was painfully slow and the price is way too expensive."

Aspect Sentiment Keywords
Food POSITIVE (+0.91) delicious
Service NEGATIVE (-0.78) slow
Price NEGATIVE (-0.65) expensive

Level 3 — Transformer Sentiment (Best Accuracy)

Python: Hugging Face Transformers
from transformers import pipeline

sentiment = pipeline("sentiment-analysis",
                     model="distilbert-base-uncased-finetuned-sst-2-english")

reviews = [
    "Absolutely love it! Works perfectly and arrived fast.",
    "Not great, not terrible. Does the job I suppose.",
    "Horrible. Completely broke after first use. Save your money.",
]
for review in reviews:
    result = sentiment(review)[0]
    print(f"[{result['label']}] ({result['score']:.1%}) {review[:55]}...")

Key Challenges

Challenge Example Problem
Sarcasm "Oh great, another Monday" Positive words, negative intent.
Negation "Not bad at all" Grammatically negative, semantically positive.
Context "The movie was predictable" Good for some genres, bad for others.
Domain Shift "This medicine is sick" "Sick" means bad in context, amazing in slang.