One-Hot Encoding Tutorial Section

One-Hot Encoding

The simplest method to represent textual data quantitatively using sparse binary vectors.

One-Hot Encoding in NLP

One-Hot Encoding is the most fundamental way to convert categorical data, such as strings of words, into numerical vectors that machine learning algorithms can understand. It represents every word in a vocabulary as a binary vector with all 0s and a single 1.

                    How It Works
                    Build a Vocabulary: Extract all unique words from your dataset.
Assign an Index: Give each word an integer ID (e.g., Apple=0, Banana=1, Cherry=2).
Create Vectors: Create an array of zeros equal to the vocabulary size. Set the value at the word's index to 1.

                

Example Vectorization

Suppose our entire vocabulary consists of exactly 4 words: ["cat", "dog", "barks", "meows"]

Word	cat (Idx 0)	dog (Idx 1)	barks (Idx 2)	meows (Idx 3)	Vector Representation
cat	1	0	0	0	`[1, 0, 0, 0]`
dog	0	1	0	0	`[0, 1, 0, 0]`
barks	0	0	1	0	`[0, 0, 1, 0]`
meows	0	0	0	1	`[0, 0, 0, 1]`

Pros

Extremely easy to understand and implement.
Requires no parameter tuning or statistical training.
Works well for very small vocabularies.

Cons (Why it fails)

Sparsity: If your language has 50,000 words, every vector is 49,999 zeros and one '1'. This wastes massive memory!
No Semantics: "Cat" and "Dog" are mathematically completely perpendicular (orthogonal). Their dot product is 0. The model doesn't know they are both pets.
No Context: You lose the sequential order of words.

Python Implementation using Scikit-Learn

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# 1. Define our vocabulary documents
words = np.array([['cat'], ['dog'], ['barks'], ['meows']])

# 2. Initialize the encoder (sparse=False returns a dense array for readability)
encoder = OneHotEncoder(sparse_output=False)

# 3. Fit and transform the words
one_hot_vectors = encoder.fit_transform(words)

# Display the categories and their vectors
print("Vocabulary Categories:", encoder.categories_[0])
print("\nVectors:")
for i, word in enumerate(words):
    print(f"{word[0]:<6} -> {one_hot_vectors[i]}")

'''
Output:
Vocabulary Categories: ['barks' 'cat' 'dog' 'meows']

Vectors:
cat    -> [0. 1. 0. 0.]
dog    -> [0. 0. 1. 0.]
barks  -> [1. 0. 0. 0.]
meows  -> [0. 0. 0. 1.]
'''

Previous: Part-of-Speech Tagging