Machine Learning Naive Bayes
Text Classification

Naive Bayes Classifier

Naive Bayes is a simple yet powerful probabilistic classifier that works well for high-dimensional data such as text.

Bayes' Theorem

Naive Bayes is based on Bayes' theorem:

\[ P(y \mid x) = \frac{P(x \mid y) P(y)}{P(x)} \]

The "naive" assumption is that features are conditionally independent given the class label.

Naive Bayes with scikit-learn

Text classification using MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

vectorizer = CountVectorizer()
X_vec = vectorizer.fit_transform(texts)  # texts: list of documents

X_train, X_test, y_train, y_test = train_test_split(
    X_vec, y, test_size=0.2, random_state=42, stratify=y
)

nb = MultinomialNB()
nb.fit(X_train, y_train)

y_pred = nb.predict(X_test)
print(classification_report(y_test, y_pred))

Types of Naive Bayes

  • GaussianNB: for continuous features assumed to follow a normal distribution.
  • MultinomialNB: for count data such as word frequencies in text.
  • BernoulliNB: for binary features (e.g. word present / absent).
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)

Strengths & Weaknesses

  • Pros: extremely fast, works well with high‑dimensional sparse features, simple to implement.
  • Cons: independence assumption is often violated; probability estimates can be poorly calibrated.
  • Despite its simplicity, Naive Bayes is a strong baseline for many NLP tasks.