One-Hot Encoding Tutorial Section

One-Hot Encoding

The simplest method to represent textual data quantitatively using sparse binary vectors.

One-Hot Encoding in NLP

One-Hot Encoding is the most fundamental way to convert categorical data, such as strings of words, into numerical vectors that machine learning algorithms can understand. It represents every word in a vocabulary as a binary vector with all 0s and a single 1.

How It Works

  1. Build a Vocabulary: Extract all unique words from your dataset.
  2. Assign an Index: Give each word an integer ID (e.g., Apple=0, Banana=1, Cherry=2).
  3. Create Vectors: Create an array of zeros equal to the vocabulary size. Set the value at the word's index to 1.
Example Vectorization

Suppose our entire vocabulary consists of exactly 4 words: ["cat", "dog", "barks", "meows"]

Word cat (Idx 0) dog (Idx 1) barks (Idx 2) meows (Idx 3) Vector Representation
cat1000[1, 0, 0, 0]
dog0100[0, 1, 0, 0]
barks0010[0, 0, 1, 0]
meows0001[0, 0, 0, 1]

Pros

  • Extremely easy to understand and implement.
  • Requires no parameter tuning or statistical training.
  • Works well for very small vocabularies.

Cons (Why it fails)

  • Sparsity: If your language has 50,000 words, every vector is 49,999 zeros and one '1'. This wastes massive memory!
  • No Semantics: "Cat" and "Dog" are mathematically completely perpendicular (orthogonal). Their dot product is 0. The model doesn't know they are both pets.
  • No Context: You lose the sequential order of words.
Python Implementation using Scikit-Learn
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# 1. Define our vocabulary documents
words = np.array([['cat'], ['dog'], ['barks'], ['meows']])

# 2. Initialize the encoder (sparse=False returns a dense array for readability)
encoder = OneHotEncoder(sparse_output=False)

# 3. Fit and transform the words
one_hot_vectors = encoder.fit_transform(words)

# Display the categories and their vectors
print("Vocabulary Categories:", encoder.categories_[0])
print("\nVectors:")
for i, word in enumerate(words):
    print(f"{word[0]:<6} -> {one_hot_vectors[i]}")

'''
Output:
Vocabulary Categories: ['barks' 'cat' 'dog' 'meows']

Vectors:
cat    -> [0. 1. 0. 0.]
dog    -> [0. 0. 1. 0.]
barks  -> [1. 0. 0. 0.]
meows  -> [0. 0. 0. 1.]
'''