Text Representation | Nikhil Learn Hub

One-Hot Encoding

One-Hot Encoding in NLP

One-Hot Encoding is the most fundamental way to convert categorical data, such as strings of words, into numerical vectors that machine learning algorithms can understand. It represents every word in a vocabulary as a binary vector with all 0s and a single 1.

                    How It Works
                    Build a Vocabulary: Extract all unique words from your dataset.
Assign an Index: Give each word an integer ID (e.g., Apple=0, Banana=1, Cherry=2).
Create Vectors: Create an array of zeros equal to the vocabulary size. Set the value at the word's index to 1.

                

Example Vectorization

Suppose our entire vocabulary consists of exactly 4 words: ["cat", "dog", "barks", "meows"]

Word	cat (Idx 0)	dog (Idx 1)	barks (Idx 2)	meows (Idx 3)	Vector Representation
cat	1	0	0	0	`[1, 0, 0, 0]`
dog	0	1	0	0	`[0, 1, 0, 0]`
barks	0	0	1	0	`[0, 0, 1, 0]`
meows	0	0	0	1	`[0, 0, 0, 1]`

Pros

Extremely easy to understand and implement.
Requires no parameter tuning or statistical training.
Works well for very small vocabularies.

Cons (Why it fails)

Sparsity: If your language has 50,000 words, every vector is 49,999 zeros and one '1'. This wastes massive memory!
No Semantics: "Cat" and "Dog" are mathematically completely perpendicular (orthogonal). Their dot product is 0. The model doesn't know they are both pets.
No Context: You lose the sequential order of words.

Python Implementation using Scikit-Learn

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# 1. Define our vocabulary documents
words = np.array([['cat'], ['dog'], ['barks'], ['meows']])

# 2. Initialize the encoder (sparse=False returns a dense array for readability)
encoder = OneHotEncoder(sparse_output=False)

# 3. Fit and transform the words
one_hot_vectors = encoder.fit_transform(words)

# Display the categories and their vectors
print("Vocabulary Categories:", encoder.categories_[0])
print("\nVectors:")
for i, word in enumerate(words):
    print(f"{word[0]:<6} -> {one_hot_vectors[i]}")

'''
Output:
Vocabulary Categories: ['barks' 'cat' 'dog' 'meows']

Vectors:
cat    -> [0. 1. 0. 0.]
dog    -> [0. 0. 1. 0.]
barks  -> [1. 0. 0. 0.]
meows  -> [0. 0. 0. 1.]
'''

N-grams

Bag of Words (BoW) completely destroys sentence structure because it treats every word as an independent token (unigram). N-grams solve this by taking contiguous sequences of n words from a given text. This allows the model to capture local context and short grammatical structures.

Types of N-grams

Consider the sentence: "The weather is very good"

Unigrams (n=1): ["The", "weather", "is", "very", "good"] (This is standard BoW!)
Bigrams (n=2): ["The weather", "weather is", "is very", "very good"] (Captures 2-word contexts)
Trigrams (n=3): ["The weather is", "weather is very", "is very good"]

Why are N-grams crucial for NLP?

1. Resolving Negations

Standard BoW misses negation. If an angry review says "not good", a Unigram BoW treats "not" and "good" separately, which might confuse the classifier into thinking positive sentiment exists. A Bigram model explicitly tracks "not good" as a single feature variable indicating negativity.

2. Named Entities

Names and monuments are often multi-word. "New York" means a city, whereas "New" and "York" separately mean an adjective and an English town. Bigrams capture ["New York"] precisely.

The Curse of Dimensionality: Why don't we use 10-grams everywhere? Because the mathematical combinations explode! If a document has 10,000 unique words, it could potentially have 10,000 squared (~100 Million) unique bigrams. Too many features causes machine learning algorithms to overfit and crash due to RAM exhaustion. In practice, n=2 (Bigrams) and n=3 (Trigrams) are the maximum commonly used.

Implementing Bigrams & Trigrams with Scikit-Learn

from sklearn.feature_extraction.text import CountVectorizer

docs = ["The food is not good but the service is very good"]

# ngram_range=(min_n, max_n). 
# Example (2,2) means ONLY Bigrams. (1,2) means Unigrams AND Bigrams.
vectorizer = CountVectorizer(ngram_range=(2, 3))

X = vectorizer.fit_transform(docs)

# Print all extracted N-gram features
features = vectorizer.get_feature_names_out()
print("Extracted N-grams:")
for f in features:
    print(f"- '{f}'")

'''
Output:
Extracted N-grams:
- 'but the'
- 'but the service'
- 'food is'
- 'food is not'
...
- 'is not good'   <-- Trigram successfully captured the true sentiment
- 'not good'      <-- Bigram captured negation
- 'very good'
'''

TF-IDF Vectorization

TF-IDF (Term Frequency - Inverse Document Frequency)

Bag of Words evaluates the frequency of words. However, frequency isn't always the best indicator of importance. If you analyze 1,000 news articles, the word "the" will appear 10,000 times. A standard BoW model will wrongly assume "the" is the most important topic of the articles!

TF-IDF solves this. It relies on a simple premise: a word is highly important if it appears frequently in a specific document, but rarely across the entire corpus (all documents).

Term Frequency (TF)

Measures how frequently a term (t) occurs in a document (d).

TF(t, d) =

Count of t in d

Total words in d

Rewards terms like "election" if it appears 10 times in a political article.

Inverse Doc Freq (IDF)

Measures how important a term is across the entire corpus (N documents).

IDF(t) =

log( Total docs (N)

Docs containing t )

Penalizes common words like "the". Since "the" is in 100% of docs, log(1) = 0. The IDF score of "the" becomes zero!

TF-IDF Score = TF * IDF

Python Implementation using Scikit-Learn

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Notice that the word 'this' and 'document' appear in multiple documents!
docs = [
    "This is a document about machine learning.",
    "This document covers natural language processing.",
    "I love baking sweet pastries, a completely unrelated topic."
]

# Initialize TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(docs)

# Get terms
terms = vectorizer.get_feature_names_out()

df = pd.DataFrame(tfidf_matrix.toarray(), columns=terms, index=['Doc 1 (ML)', 'Doc 2 (NLP)', 'Doc 3 (Baking)'])

# Format for rounded float readability
pd.set_option('display.float_format', lambda x: '%.2f' % x)
print(df[['document', 'learning', 'machine', 'baking']])

'''
Output:
                document  learning  machine  baking
Doc 1 (ML)          0.47      0.62     0.62    0.00
Doc 2 (NLP)         0.47      0.00     0.00    0.00
Doc 3 (Baking)      0.00      0.00     0.00    0.45

Notice: 'document' has a LOWER score (0.47) than 'machine' (0.62) in Doc 1. 
Why? Because 'document' appears in both Doc 1 and Doc 2, so the IDF penalty lowered its uniqueness score!
'''

Co-occurrence Matrix

Bag-of-Words and TF-IDF create Document-Term matrices (Documents as rows, Words as columns). In contrast, a Co-occurrence Matrix creates a Word-Word matrix (Words as rows, Words as columns). It captures how often two different words appear together within a specific "window" distance in a sentence.

This follows the distribution hypothesis by famous linguist J.R. Firth: "You shall know a word by the company it keeps." Words that appear in similar contexts usually share semantic meaning.

How it works: The Context Window

Assume a corpus with one sentence: "deep learning is incredibly exciting"

If we set our Window Size = 1 (look 1 word left, 1 word right), we scan the text:

Focus on "learning": Left is "deep", Right is "is".
Add +1 to coordinates (learning, deep) and (learning, is) in the matrix.

	deep	learning	is	incredibly	exciting
deep	0	1	0	0	0
learning	1	0	1	0	0
is	0	1	0	1	0
incredibly	0	0	1	0	1
exciting	0	0	0	1	0

                            Advantages
                            Preserves profound semantic relationships (unlike BoW).
Vectors from this matrix possess geometric meaning. Synonyms clustered together in mathematical space.
Forms the fundamental mathematical backbone for GloVe embeddings and Latent Semantic Analysis (LSA).

                        

Disadvantages

Memory Intensive: Matrix size is Vocab x Vocab. If V=100,000, you need an array with 10 Billion elements! (Usually requires Sparse Matrices).
Requires Singular Value Decomposition (SVD) dimensionality reduction to be practically useful in modeling.

Note on Context Window: A small window size (1-2) identifies words that are grammatically interchangeable (e.g., "dog" and "cat" are both followed by "barks" or "meows"). A large window size (5-10) identifies topically related words (e.g., "doctor" and "hospital" appear in the same paragraph but not necessarily adjacent).