Machine Learning
Naive Bayes
Text Classification
Naive Bayes Classifier
Naive Bayes is a simple yet powerful probabilistic classifier that works well for high-dimensional data such as text.
Bayes' Theorem
Naive Bayes is based on Bayes' theorem:
\[ P(y \mid x) = \frac{P(x \mid y) P(y)}{P(x)} \]
The "naive" assumption is that features are conditionally independent given the class label.
Naive Bayes with scikit-learn
Text classification using MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
vectorizer = CountVectorizer()
X_vec = vectorizer.fit_transform(texts) # texts: list of documents
X_train, X_test, y_train, y_test = train_test_split(
X_vec, y, test_size=0.2, random_state=42, stratify=y
)
nb = MultinomialNB()
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
print(classification_report(y_test, y_pred))
Types of Naive Bayes
- GaussianNB: for continuous features assumed to follow a normal distribution.
- MultinomialNB: for count data such as word frequencies in text.
- BernoulliNB: for binary features (e.g. word present / absent).
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
Strengths & Weaknesses
- Pros: extremely fast, works well with high‑dimensional sparse features, simple to implement.
- Cons: independence assumption is often violated; probability estimates can be poorly calibrated.
- Despite its simplicity, Naive Bayes is a strong baseline for many NLP tasks.