Bag of Words (BoW)
Tutorial Section
Bag of Words (BoW)
Count the frequency of words in documents to create dense document-term matrices for classification.
Bag of Words (BoW)
While One-Hot Encoding represents individually unique words, Bag of Words (BoW) takes it a step further by representing entire sentences or documents based on the frequency (count) of the words that appear in them. It is called a "Bag" because any information about the grammatical order or sequence of the words is entirely discarded.
Intuition: Imagine taking a paragraph, cutting out every individual word with scissors, and tossing them into a bag. You can pull them out and count how many times "bank" appears, but you have no idea what order the words originally formed.
Creating a Document-Term Matrix
Bag of words generates a 2D matrix where rows represent documents and columns represent the vocabulary words. The values are the frequencies of those words.
Example Process
Suppose we have two small documents:
- Doc 1: "John likes to watch movies. Mary likes movies too."
- Doc 2: "John also likes to watch football games."
["john", "likes", "to", "watch", "movies", "mary", "too", "also", "football", "games"]
Step 2: Count Frequencies for each Document
| Document | john | likes | to | watch | movies | mary | too | also | football | games |
|---|---|---|---|---|---|---|---|---|---|---|
| Doc 1 | 1 | 2 | 1 | 1 | 2 | 1 | 1 | 0 | 0 | 0 |
| Doc 2 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 |
Advantages
- Simple and effective baseline model.
- Variable length sentences are converted into fixed-length vectors.
- Useful for simple Topic Modeling and Spam Classification.
Limitations
- Loss of Semantic Meaning: "You are bad, not good" has the exact same BoW vector as "You are good, not bad" despite opposite meanings!
- Vocab size can balloon to millions of columns, making models extremely memory-heavy.
Python Implementation using Scikit-Learn (CountVectorizer)
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
docs = [
"John likes to watch movies. Mary likes movies too.",
"John also likes to watch football games."
]
# 1. Initialize the CountVectorizer
vectorizer = CountVectorizer()
# 2. Fit to the corpus and transform it into a sparse matrix
X = vectorizer.fit_transform(docs)
# 3. Get the vocabulary (column labels)
vocab = vectorizer.get_feature_names_out()
# Create a readable Pandas DataFrame
df = pd.DataFrame(X.toarray(), columns=vocab, index=['Doc 1', 'Doc 2'])
print(df)
'''
Output:
also football games john likes mary movies to too watch
Doc 1 0 0 0 1 2 1 2 1 1 1
Doc 2 1 1 1 1 1 0 0 1 0 1
'''