Bag of Words (BoW) Tutorial Section

Bag of Words (BoW)

Count the frequency of words in documents to create dense document-term matrices for classification.

Bag of Words (BoW)

While One-Hot Encoding represents individually unique words, Bag of Words (BoW) takes it a step further by representing entire sentences or documents based on the frequency (count) of the words that appear in them. It is called a "Bag" because any information about the grammatical order or sequence of the words is entirely discarded.

Intuition: Imagine taking a paragraph, cutting out every individual word with scissors, and tossing them into a bag. You can pull them out and count how many times "bank" appears, but you have no idea what order the words originally formed.

Creating a Document-Term Matrix

Bag of words generates a 2D matrix where rows represent documents and columns represent the vocabulary words. The values are the frequencies of those words.

Example Process

Suppose we have two small documents:

Doc 1: "John likes to watch movies. Mary likes movies too."
Doc 2: "John also likes to watch football games."

Step 1: Extract Vocabulary (Unique words)
["john", "likes", "to", "watch", "movies", "mary", "too", "also", "football", "games"] Step 2: Count Frequencies for each Document

Document	john	likes	to	watch	movies	mary	too	also	football	games
Doc 1	1	2	1	1	2	1	1	0	0	0
Doc 2	1	1	1	1	0	0	0	1	1	1

Advantages

Simple and effective baseline model.
Variable length sentences are converted into fixed-length vectors.
Useful for simple Topic Modeling and Spam Classification.

Limitations

Loss of Semantic Meaning: "You are bad, not good" has the exact same BoW vector as "You are good, not bad" despite opposite meanings!
Vocab size can balloon to millions of columns, making models extremely memory-heavy.

Python Implementation using Scikit-Learn (CountVectorizer)

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

docs = [
    "John likes to watch movies. Mary likes movies too.",
    "John also likes to watch football games."
]

# 1. Initialize the CountVectorizer
vectorizer = CountVectorizer()

# 2. Fit to the corpus and transform it into a sparse matrix
X = vectorizer.fit_transform(docs)

# 3. Get the vocabulary (column labels)
vocab = vectorizer.get_feature_names_out()

# Create a readable Pandas DataFrame
df = pd.DataFrame(X.toarray(), columns=vocab, index=['Doc 1', 'Doc 2'])
print(df)

'''
Output:
       also  football  games  john  likes  mary  movies  to  too  watch
Doc 1     0         0      0     1      2     1       2   1    1      1
Doc 2     1         1      1     1      1     0       0   1    0      1
'''

Previous: One-Hot Encoding