Bag of Words (BoW) Tutorial Section

Bag of Words (BoW)

Count the frequency of words in documents to create dense document-term matrices for classification.

Bag of Words (BoW)

While One-Hot Encoding represents individually unique words, Bag of Words (BoW) takes it a step further by representing entire sentences or documents based on the frequency (count) of the words that appear in them. It is called a "Bag" because any information about the grammatical order or sequence of the words is entirely discarded.

Intuition: Imagine taking a paragraph, cutting out every individual word with scissors, and tossing them into a bag. You can pull them out and count how many times "bank" appears, but you have no idea what order the words originally formed.

Creating a Document-Term Matrix

Bag of words generates a 2D matrix where rows represent documents and columns represent the vocabulary words. The values are the frequencies of those words.

Example Process

Suppose we have two small documents:

  • Doc 1: "John likes to watch movies. Mary likes movies too."
  • Doc 2: "John also likes to watch football games."
Step 1: Extract Vocabulary (Unique words)
["john", "likes", "to", "watch", "movies", "mary", "too", "also", "football", "games"] Step 2: Count Frequencies for each Document
Document johnlikestowatchmoviesmarytooalsofootballgames
Doc 1 1211211000
Doc 2 1111000111
Advantages
  • Simple and effective baseline model.
  • Variable length sentences are converted into fixed-length vectors.
  • Useful for simple Topic Modeling and Spam Classification.
Limitations
  • Loss of Semantic Meaning: "You are bad, not good" has the exact same BoW vector as "You are good, not bad" despite opposite meanings!
  • Vocab size can balloon to millions of columns, making models extremely memory-heavy.
Python Implementation using Scikit-Learn (CountVectorizer)
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

docs = [
    "John likes to watch movies. Mary likes movies too.",
    "John also likes to watch football games."
]

# 1. Initialize the CountVectorizer
vectorizer = CountVectorizer()

# 2. Fit to the corpus and transform it into a sparse matrix
X = vectorizer.fit_transform(docs)

# 3. Get the vocabulary (column labels)
vocab = vectorizer.get_feature_names_out()

# Create a readable Pandas DataFrame
df = pd.DataFrame(X.toarray(), columns=vocab, index=['Doc 1', 'Doc 2'])
print(df)

'''
Output:
       also  football  games  john  likes  mary  movies  to  too  watch
Doc 1     0         0      0     1      2     1       2   1    1      1
Doc 2     1         1      1     1      1     0       0   1    0      1
'''