One-Hot Encoding
Tutorial Section
One-Hot Encoding
The simplest method to represent textual data quantitatively using sparse binary vectors.
One-Hot Encoding in NLP
One-Hot Encoding is the most fundamental way to convert categorical data, such as strings of words, into numerical vectors that machine learning algorithms can understand. It represents every word in a vocabulary as a binary vector with all 0s and a single 1.
How It Works
- Build a Vocabulary: Extract all unique words from your dataset.
- Assign an Index: Give each word an integer ID (e.g., Apple=0, Banana=1, Cherry=2).
- Create Vectors: Create an array of zeros equal to the vocabulary size. Set the value at the word's index to 1.
Example Vectorization
Suppose our entire vocabulary consists of exactly 4 words: ["cat", "dog", "barks", "meows"]
| Word | cat (Idx 0) | dog (Idx 1) | barks (Idx 2) | meows (Idx 3) | Vector Representation |
|---|---|---|---|---|---|
| cat | 1 | 0 | 0 | 0 | [1, 0, 0, 0] |
| dog | 0 | 1 | 0 | 0 | [0, 1, 0, 0] |
| barks | 0 | 0 | 1 | 0 | [0, 0, 1, 0] |
| meows | 0 | 0 | 0 | 1 | [0, 0, 0, 1] |
Pros
- Extremely easy to understand and implement.
- Requires no parameter tuning or statistical training.
- Works well for very small vocabularies.
Cons (Why it fails)
- Sparsity: If your language has 50,000 words, every vector is 49,999 zeros and one '1'. This wastes massive memory!
- No Semantics: "Cat" and "Dog" are mathematically completely perpendicular (orthogonal). Their dot product is 0. The model doesn't know they are both pets.
- No Context: You lose the sequential order of words.
Python Implementation using Scikit-Learn
from sklearn.preprocessing import OneHotEncoder
import numpy as np
# 1. Define our vocabulary documents
words = np.array([['cat'], ['dog'], ['barks'], ['meows']])
# 2. Initialize the encoder (sparse=False returns a dense array for readability)
encoder = OneHotEncoder(sparse_output=False)
# 3. Fit and transform the words
one_hot_vectors = encoder.fit_transform(words)
# Display the categories and their vectors
print("Vocabulary Categories:", encoder.categories_[0])
print("\nVectors:")
for i, word in enumerate(words):
print(f"{word[0]:<6} -> {one_hot_vectors[i]}")
'''
Output:
Vocabulary Categories: ['barks' 'cat' 'dog' 'meows']
Vectors:
cat -> [0. 1. 0. 0.]
dog -> [0. 0. 1. 0.]
barks -> [1. 0. 0. 0.]
meows -> [0. 0. 0. 1.]
'''