Word Embeddings
Dense vector representations: Word2Vec, GloVe, FastText, and ELMo.
Word Embeddings
Introduction to Word Embeddings
We've looked at One-Hot, BoW, and TF-IDF encoding. All of these generate Sparse Vectors (mostly zeros) where the length of the vector is equal to the massive size of the vocabulary (50k+ dimensions). Word Embeddings represented a paradigm shift in 2013: migrating from Sparse Vectors to Dense Vectors.
Sparse Vector (One-Hot)
"King" = [0, 0, 1, 0, 0, 0, 0, 0, 0....]
"Man" = [0, 0, 0, 0, 0, 1, 0, 0, 0....]
Dense Vector (Embedding)
"King" = [0.98, 0.45, -0.6, 0.12, 0.8]
"Man" = [0.93, 0.41, -0.9, 0.15, 0.3]
How Dense Embeddings Work
Rather than counting words, an embedding model uses Neural Networks to map words into a continuous geometric space. Each dimension (number) in the fixed-length vector subtly captures a latent semantic feature (e.g., gender, royalty, color, sentiment).
- Because the dimensions are dense (floats between -1 and 1 instead of sparse 0s), they compress vast vocabulary context into just 300 dimensions.
- Cosine Similarity on the angles of these vectors accurately measures how conceptually similar two words are.
The State of the Art: The "Big 3" Static Embeddings
1. Word2Vec (2013)
Developed by Google
A predictive model that uses a shallow Neural Network to guess words based on their neighbors (or vice versa).
2. GloVe (2014)
Developed by Stanford
A count-based model that performs matrix factorization on a gigantic global word Co-occurrence Matrix to derive vectors.
3. FastText (2016)
Developed by Facebook AI
An extension of Word2Vec that trains on sub-word character N-grams (e.g., "apple" = "app", "ppl", "ple"). Can handle unknown spelling errors!
Word2Vec
Word2Vec Formulations
Introduced by Tomas Mikolov at Google in 2013, Word2Vec is not a single algorithm, but a 2-layer Neural Network framework containing two distinct architectural flavors: CBOW and Skip-Gram. The secret? It learns embeddings as a byproduct of a "fake" classification task.
1. Continuous Bag of Words (CBOW)
Predicts the Target word from the Context.
"The fox ___ over the dog"Neural Net Predicts: "jumps"
CBOW is several times faster to train than Skip-Gram and has slightly better accuracy for frequent words.
2. Skip-Gram
Predicts Context words from a given Target word.
"___ ___ jumps ___ ___"Neural Net Predicts: "fox", "over", "dog"
Skip-Gram is slower, but it works extremely well with small amounts of training data and handles rare words exceptionally well.
The Magic of Vector Math
Word2Vec famously proved that its continuous vector space accurately captured logical relational analogies through simple linear algebra additions and subtractions.
Queen
Rome
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
# Sample corpus (should be massive in reality!)
corpus = [
"Machine learning is the study of computer algorithms.",
"Deep learning is a subset of machine learning using neural networks.",
"Artificial intelligence creates smart machines."
]
# Tokenize sentences into lists of words
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]
# 1. Train Word2Vec Model
# vector_size: dimension of embedding (usually 100-300)
# window: context window size
# sg: 1 for Skip-Gram, 0 for CBOW
model = Word2Vec(sentences=tokenized_corpus, vector_size=10, window=3, min_count=1, sg=1)
# 2. Extract the learned vector for a word
print("Vector for 'learning':")
print(model.wv['learning'])
# 3. Find most similar semantic words!
print("\nMost similar to 'learning':")
print(model.wv.most_similar('learning', topn=3))
GloVe Vectors
GloVe (Global Vectors)
Created by Stanford researchers in 2014, GloVe (Global Vectors for Word Representation) sought to combine the best of both previous NLP worlds. It argues that while predictive models like Word2Vec are great, they only look at a small window of local context. They miss out on the global statistics of the entire document corpus.
Combining Two Philosophies
GloVe bridges two different text representation paradigms:
- Matrix Factorization (Global): Like LSA relying on the giant Co-occurrence Matrix, which gives excellent global statistical frequencies but performs poorly on analogies.
- Local Context Window (Predictive): Like Word2Vec, which learns excellent analogy math but fails to utilize global document statistics.
GloVe trains on the non-zero entries of a global word-word co-occurrence matrix, rather than on the entire sparse matrix or just separate local windows, optimizing a log-bilinear regression model.
The Ratio of Probabilities
The mathematical genius of GloVe lies in looking at ratios of co-occurrence probabilities rather than pure probabilities.
Let's look at the words Ice and Steam. We want to see how they interact with probe words like Solid and Gas.
| Probability Ratio | Solid | Gas | Water (Shares both) | Fashion (Irrelevant) |
|---|---|---|---|---|
P(k | Ice) |
High (~1.9x10-4) | Low (~6.6x10-5) | High | Low |
P(k | Steam) |
Low (~2.2x10-5) | High (~7.8x10-4) | High | Low |
Ratio: P(Ice)/P(Steam) |
Large (8.9) | Small (0.08) | Neutral (~1.0) | Neutral (~1.0) |
Takeaway: The ratio clearly discriminates the relevant thermodynamic properties (Solid vs Gas) while canceling out words that appear frequently with both (Water) or rarely with both (Fashion). GloVe forces the model to learn these ratios to encode meaning.
# Rather than training GloVe from scratch (which takes massive resources),
# industry standard is to download pre-trained vectors.
# spaCy's 'en_core_web_md' or 'lg' come bundled with GloVe vectors.
import spacy
# Load Medium model (contains 20k GloVe vectors of 300 dimensions)
nlp = spacy.load("en_core_web_md")
word1 = nlp("king")
word2 = nlp("queen")
word3 = nlp("apple")
# Extract the dense 300D GloVe numpy array
print(f"Vector shape for 'king': {word1.vector.shape}")
# >>> (300,)
# spaCy automatically calculates Cosine Similarity using these GloVe vectors!
print(f"Similarity (King vs Queen): {word1.similarity(word2):.3f}")
# >>> 0.725
print(f"Similarity (King vs Apple): {word1.similarity(word3):.3f}")
# >>> 0.204
FastText
FastText: Subword Embeddings
Created by Facebook's AI Research (FAIR) lab in 2016, FastText is an extension of the Word2Vec model. While Word2Vec and GloVe treat every word as a distinct, atomic entity, FastText breaks words down into smaller pieces called character n-grams.
The Subword Breakdown Example
How does FastText view the word "apple" using an n-gram size of n=3?
FastText adds special boundary characters < and > to denote the beginning and end of a word.
N-grams (n=3): [ "<ap", "app", "ppl", "ple", "le>" ]
The final embedding for "apple" is the sum of the embeddings of all these little n-grams (plus the embedding for the whole word itself)!
Why is this Revolutionary?
- Handles Typos: If a user types "appple", Word2Vec completely crashes because it has never seen that word. FastText succeeds because "appple" shares 80% of its subword n-grams with "apple".
- Solves the OOV Problem: It can generate embeddings for Out-Of-Vocabulary (OOV) words it has never seen before, by summing their character parts.
- Great for Morphological Languages: Highly effective for languages like Turkish or Finnish where words are formed by gluing together many suffixes.
from gensim.models import FastText
corpus = [["hello", "world", "this", "is", "nlp"],
["machine", "learning", "is", "awesome"]]
# Train FastText
# min_n and max_n control the character n-gram sizes
model = FastText(sentences=corpus, vector_size=10,
window=3, min_count=1, min_n=3, max_n=6)
# The model has never seen "learnings", but it can
# calculate a vector anyway based on "learn" + "ing" + "s"!
oov_word = "learnings"
# This works perfectly, unlike Word2Vec!
vector = model.wv[oov_word]
print(f"Vector for {oov_word} generated successfully!")
ELMo Embeddings
ELMo: Contextual Embeddings
ELMo (Embeddings from Language Models), introduced in 2018 by AllenNLP, marked the critical turning point in NLP history: the shift from entirely Static Embeddings (Word2Vec/GloVe) to fully Contextual Embeddings.
In Word2Vec, the word "bank" has exactly one mathematical vector. Whether it's "river bank" or "savings bank", Word2Vec outputs the exact same numbers. This is mathematically flawed because the meaning is entirely context-dependent!
How ELMo Solves This
ELMo does not use a fixed dictionary lookup. Instead, ELMo calculates the embedding for a word on-the-fly by looking at the entire sentence it lives in.
Contextual Output
Sentence A
"He deposited money in the bank."
Sentence B
"He sat by the river bank."
Different contexts = Completely different mathematical vectors for the exact same word!
The Architecture: Bi-Directional LSTM
ELMo uses a deep, 2-layer Bi-Directional LSTM (Long Short-Term Memory) neural network trained on a standard Language Modeling task (predicting the next word).
- Forward Pass: Reads the sentence from left-to-right to understand past context.
- Backward Pass: Reads the sentence from right-to-left to understand future context.
The final embedding is a weighted sum of the internal states obtained from these Bi-LSTMs. ELMo paved the way immediately for BERT and Transformer architectures.