Word Embeddings
Transition from sparse matrices to dense, continuous, low-dimensional vector spaces capable of capturing complex meaning.
Introduction to Word Embeddings
We've looked at One-Hot, BoW, and TF-IDF encoding. All of these generate Sparse Vectors (mostly zeros) where the length of the vector is equal to the massive size of the vocabulary (50k+ dimensions). Word Embeddings represented a paradigm shift in 2013: migrating from Sparse Vectors to Dense Vectors.
Sparse Vector (One-Hot)
"King" = [0, 0, 1, 0, 0, 0, 0, 0, 0....]
"Man" = [0, 0, 0, 0, 0, 1, 0, 0, 0....]
Dense Vector (Embedding)
"King" = [0.98, 0.45, -0.6, 0.12, 0.8]
"Man" = [0.93, 0.41, -0.9, 0.15, 0.3]
How Dense Embeddings Work
Rather than counting words, an embedding model uses Neural Networks to map words into a continuous geometric space. Each dimension (number) in the fixed-length vector subtly captures a latent semantic feature (e.g., gender, royalty, color, sentiment).
- Because the dimensions are dense (floats between -1 and 1 instead of sparse 0s), they compress vast vocabulary context into just 300 dimensions.
- Cosine Similarity on the angles of these vectors accurately measures how conceptually similar two words are.
The State of the Art: The "Big 3" Static Embeddings
1. Word2Vec (2013)
Developed by Google
A predictive model that uses a shallow Neural Network to guess words based on their neighbors (or vice versa).
2. GloVe (2014)
Developed by Stanford
A count-based model that performs matrix factorization on a gigantic global word Co-occurrence Matrix to derive vectors.
3. FastText (2016)
Developed by Facebook AI
An extension of Word2Vec that trains on sub-word character N-grams (e.g., "apple" = "app", "ppl", "ple"). Can handle unknown spelling errors!