Text Representation
One-hot encoding, n-grams, TF-IDF, and co-occurrence matrix methods.
One-Hot Encoding
One-Hot Encoding in NLP
One-Hot Encoding is the most fundamental way to convert categorical data, such as strings of words, into numerical vectors that machine learning algorithms can understand. It represents every word in a vocabulary as a binary vector with all 0s and a single 1.
How It Works
- Build a Vocabulary: Extract all unique words from your dataset.
- Assign an Index: Give each word an integer ID (e.g., Apple=0, Banana=1, Cherry=2).
- Create Vectors: Create an array of zeros equal to the vocabulary size. Set the value at the word's index to 1.
Example Vectorization
Suppose our entire vocabulary consists of exactly 4 words: ["cat", "dog", "barks", "meows"]
| Word | cat (Idx 0) | dog (Idx 1) | barks (Idx 2) | meows (Idx 3) | Vector Representation |
|---|---|---|---|---|---|
| cat | 1 | 0 | 0 | 0 | [1, 0, 0, 0] |
| dog | 0 | 1 | 0 | 0 | [0, 1, 0, 0] |
| barks | 0 | 0 | 1 | 0 | [0, 0, 1, 0] |
| meows | 0 | 0 | 0 | 1 | [0, 0, 0, 1] |
Pros
- Extremely easy to understand and implement.
- Requires no parameter tuning or statistical training.
- Works well for very small vocabularies.
Cons (Why it fails)
- Sparsity: If your language has 50,000 words, every vector is 49,999 zeros and one '1'. This wastes massive memory!
- No Semantics: "Cat" and "Dog" are mathematically completely perpendicular (orthogonal). Their dot product is 0. The model doesn't know they are both pets.
- No Context: You lose the sequential order of words.
from sklearn.preprocessing import OneHotEncoder
import numpy as np
# 1. Define our vocabulary documents
words = np.array([['cat'], ['dog'], ['barks'], ['meows']])
# 2. Initialize the encoder (sparse=False returns a dense array for readability)
encoder = OneHotEncoder(sparse_output=False)
# 3. Fit and transform the words
one_hot_vectors = encoder.fit_transform(words)
# Display the categories and their vectors
print("Vocabulary Categories:", encoder.categories_[0])
print("\nVectors:")
for i, word in enumerate(words):
print(f"{word[0]:<6} -> {one_hot_vectors[i]}")
'''
Output:
Vocabulary Categories: ['barks' 'cat' 'dog' 'meows']
Vectors:
cat -> [0. 1. 0. 0.]
dog -> [0. 0. 1. 0.]
barks -> [1. 0. 0. 0.]
meows -> [0. 0. 0. 1.]
'''
N-grams
N-grams
Bag of Words (BoW) completely destroys sentence structure because it treats every word as an independent token (unigram). N-grams solve this by taking contiguous sequences of n words from a given text. This allows the model to capture local context and short grammatical structures.
Types of N-grams
Consider the sentence: "The weather is very good"
- Unigrams (n=1): ["The", "weather", "is", "very", "good"] (This is standard BoW!)
- Bigrams (n=2): ["The weather", "weather is", "is very", "very good"] (Captures 2-word contexts)
- Trigrams (n=3): ["The weather is", "weather is very", "is very good"]
Why are N-grams crucial for NLP?
1. Resolving Negations
Standard BoW misses negation. If an angry review says "not good", a Unigram BoW treats "not" and "good" separately, which might confuse the classifier into thinking positive sentiment exists. A Bigram model explicitly tracks "not good" as a single feature variable indicating negativity.
2. Named Entities
Names and monuments are often multi-word. "New York" means a city, whereas "New" and "York" separately mean an adjective and an English town. Bigrams capture ["New York"] precisely.
from sklearn.feature_extraction.text import CountVectorizer
docs = ["The food is not good but the service is very good"]
# ngram_range=(min_n, max_n).
# Example (2,2) means ONLY Bigrams. (1,2) means Unigrams AND Bigrams.
vectorizer = CountVectorizer(ngram_range=(2, 3))
X = vectorizer.fit_transform(docs)
# Print all extracted N-gram features
features = vectorizer.get_feature_names_out()
print("Extracted N-grams:")
for f in features:
print(f"- '{f}'")
'''
Output:
Extracted N-grams:
- 'but the'
- 'but the service'
- 'food is'
- 'food is not'
...
- 'is not good' <-- Trigram successfully captured the true sentiment
- 'not good' <-- Bigram captured negation
- 'very good'
'''
TF-IDF Vectorization
TF-IDF (Term Frequency - Inverse Document Frequency)
Bag of Words evaluates the frequency of words. However, frequency isn't always the best indicator of importance. If you analyze 1,000 news articles, the word "the" will appear 10,000 times. A standard BoW model will wrongly assume "the" is the most important topic of the articles!
TF-IDF solves this. It relies on a simple premise: a word is highly important if it appears frequently in a specific document, but rarely across the entire corpus (all documents).
Term Frequency (TF)
Measures how frequently a term (t) occurs in a document (d).
Count of t in d
Total words in d
Rewards terms like "election" if it appears 10 times in a political article.
Inverse Doc Freq (IDF)
Measures how important a term is across the entire corpus (N documents).
log( Total docs (N)
Docs containing t )
Penalizes common words like "the". Since "the" is in 100% of docs, log(1) = 0. The IDF score of "the" becomes zero!
TF-IDF Score = TF * IDF
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
# Notice that the word 'this' and 'document' appear in multiple documents!
docs = [
"This is a document about machine learning.",
"This document covers natural language processing.",
"I love baking sweet pastries, a completely unrelated topic."
]
# Initialize TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(docs)
# Get terms
terms = vectorizer.get_feature_names_out()
df = pd.DataFrame(tfidf_matrix.toarray(), columns=terms, index=['Doc 1 (ML)', 'Doc 2 (NLP)', 'Doc 3 (Baking)'])
# Format for rounded float readability
pd.set_option('display.float_format', lambda x: '%.2f' % x)
print(df[['document', 'learning', 'machine', 'baking']])
'''
Output:
document learning machine baking
Doc 1 (ML) 0.47 0.62 0.62 0.00
Doc 2 (NLP) 0.47 0.00 0.00 0.00
Doc 3 (Baking) 0.00 0.00 0.00 0.45
Notice: 'document' has a LOWER score (0.47) than 'machine' (0.62) in Doc 1.
Why? Because 'document' appears in both Doc 1 and Doc 2, so the IDF penalty lowered its uniqueness score!
'''
Co-occurrence Matrix
Co-occurrence Matrix
Bag-of-Words and TF-IDF create Document-Term matrices (Documents as rows, Words as columns). In contrast, a Co-occurrence Matrix creates a Word-Word matrix (Words as rows, Words as columns). It captures how often two different words appear together within a specific "window" distance in a sentence.
This follows the distribution hypothesis by famous linguist J.R. Firth: "You shall know a word by the company it keeps." Words that appear in similar contexts usually share semantic meaning.
How it works: The Context Window
Assume a corpus with one sentence: "deep learning is incredibly exciting"
If we set our Window Size = 1 (look 1 word left, 1 word right), we scan the text:
- Focus on "learning": Left is "deep", Right is "is".
- Add +1 to coordinates (learning, deep) and (learning, is) in the matrix.
| deep | learning | is | incredibly | exciting | |
|---|---|---|---|---|---|
| deep | 0 | 1 | 0 | 0 | 0 |
| learning | 1 | 0 | 1 | 0 | 0 |
| is | 0 | 1 | 0 | 1 | 0 |
| incredibly | 0 | 0 | 1 | 0 | 1 |
| exciting | 0 | 0 | 0 | 1 | 0 |
Advantages
- Preserves profound semantic relationships (unlike BoW).
- Vectors from this matrix possess geometric meaning. Synonyms clustered together in mathematical space.
- Forms the fundamental mathematical backbone for GloVe embeddings and Latent Semantic Analysis (LSA).
Disadvantages
- Memory Intensive: Matrix size is Vocab x Vocab. If V=100,000, you need an array with 10 Billion elements! (Usually requires Sparse Matrices).
- Requires Singular Value Decomposition (SVD) dimensionality reduction to be practically useful in modeling.