TF-IDF Vectorization
Learn Term Frequency-Inverse Document Frequency to weigh the true importance of words by penalizing common terms.
TF-IDF (Term Frequency - Inverse Document Frequency)
Bag of Words evaluates the frequency of words. However, frequency isn't always the best indicator of importance. If you analyze 1,000 news articles, the word "the" will appear 10,000 times. A standard BoW model will wrongly assume "the" is the most important topic of the articles!
TF-IDF solves this. It relies on a simple premise: a word is highly important if it appears frequently in a specific document, but rarely across the entire corpus (all documents).
Term Frequency (TF)
Measures how frequently a term (t) occurs in a document (d).
Count of t in d
Total words in d
Rewards terms like "election" if it appears 10 times in a political article.
Inverse Doc Freq (IDF)
Measures how important a term is across the entire corpus (N documents).
log( Total docs (N)
Docs containing t )
Penalizes common words like "the". Since "the" is in 100% of docs, log(1) = 0. The IDF score of "the" becomes zero!
TF-IDF Score = TF * IDF
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
# Notice that the word 'this' and 'document' appear in multiple documents!
docs = [
"This is a document about machine learning.",
"This document covers natural language processing.",
"I love baking sweet pastries, a completely unrelated topic."
]
# Initialize TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(docs)
# Get terms
terms = vectorizer.get_feature_names_out()
df = pd.DataFrame(tfidf_matrix.toarray(), columns=terms, index=['Doc 1 (ML)', 'Doc 2 (NLP)', 'Doc 3 (Baking)'])
# Format for rounded float readability
pd.set_option('display.float_format', lambda x: '%.2f' % x)
print(df[['document', 'learning', 'machine', 'baking']])
'''
Output:
document learning machine baking
Doc 1 (ML) 0.47 0.62 0.62 0.00
Doc 2 (NLP) 0.47 0.00 0.00 0.00
Doc 3 (Baking) 0.00 0.00 0.00 0.45
Notice: 'document' has a LOWER score (0.47) than 'machine' (0.62) in Doc 1.
Why? Because 'document' appears in both Doc 1 and Doc 2, so the IDF penalty lowered its uniqueness score!
'''