Word Embeddings

Introduction to Word Embeddings

We've looked at One-Hot, BoW, and TF-IDF encoding. All of these generate Sparse Vectors (mostly zeros) where the length of the vector is equal to the massive size of the vocabulary (50k+ dimensions). Word Embeddings represented a paradigm shift in 2013: migrating from Sparse Vectors to Dense Vectors.

Sparse Vector (One-Hot)

"King" = [0, 0, 1, 0, 0, 0, 0, 0, 0....]

"Man" = [0, 0, 0, 0, 0, 1, 0, 0, 0....]

Length: 50,000+

Similarity: 0.0 (No overlap)

Dense Vector (Embedding)

"King" = [0.98, 0.45, -0.6, 0.12, 0.8]

"Man" = [0.93, 0.41, -0.9, 0.15, 0.3]

Length: Fixed sizes (e.g., 300)

Similarity: High (Vectors point same way)

How Dense Embeddings Work

Rather than counting words, an embedding model uses Neural Networks to map words into a continuous geometric space. Each dimension (number) in the fixed-length vector subtly captures a latent semantic feature (e.g., gender, royalty, color, sentiment).

Because the dimensions are dense (floats between -1 and 1 instead of sparse 0s), they compress vast vocabulary context into just 300 dimensions.
Cosine Similarity on the angles of these vectors accurately measures how conceptually similar two words are.

The State of the Art: The "Big 3" Static Embeddings

1. Word2Vec (2013)

Developed by Google

A predictive model that uses a shallow Neural Network to guess words based on their neighbors (or vice versa).

2. GloVe (2014)

Developed by Stanford

A count-based model that performs matrix factorization on a gigantic global word Co-occurrence Matrix to derive vectors.

3. FastText (2016)

Developed by Facebook AI

An extension of Word2Vec that trains on sub-word character N-grams (e.g., "apple" = "app", "ppl", "ple"). Can handle unknown spelling errors!

*Note: Since 2018, static embeddings have largely been superseded by Contextual Embeddings like BERT and LLMs, though they remain vital for lightweight tasks.

Word2Vec

Word2Vec Formulations

Introduced by Tomas Mikolov at Google in 2013, Word2Vec is not a single algorithm, but a 2-layer Neural Network framework containing two distinct architectural flavors: CBOW and Skip-Gram. The secret? It learns embeddings as a byproduct of a "fake" classification task.

1. Continuous Bag of Words (CBOW)

Predicts the Target word from the Context.

Given the context: "The fox ___ over the dog"

Neural Net Predicts: "jumps"

CBOW is several times faster to train than Skip-Gram and has slightly better accuracy for frequent words.

2. Skip-Gram

Predicts Context words from a given Target word.

Given the target: "___ ___ jumps ___ ___"

Neural Net Predicts: "fox", "over", "dog"

Skip-Gram is slower, but it works extremely well with small amounts of training data and handles rare words exceptionally well.

The Magic of Vector Math

Word2Vec famously proved that its continuous vector space accurately captured logical relational analogies through simple linear algebra additions and subtractions.

King - Man + Woman

Queen

Paris - France + Italy

Rome

Training Word2Vec using Python Gensim

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Sample corpus (should be massive in reality!)
corpus = [
    "Machine learning is the study of computer algorithms.",
    "Deep learning is a subset of machine learning using neural networks.",
    "Artificial intelligence creates smart machines."
]

# Tokenize sentences into lists of words
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]

# 1. Train Word2Vec Model
# vector_size: dimension of embedding (usually 100-300)
# window: context window size
# sg: 1 for Skip-Gram, 0 for CBOW
model = Word2Vec(sentences=tokenized_corpus, vector_size=10, window=3, min_count=1, sg=1)

# 2. Extract the learned vector for a word
print("Vector for 'learning':")
print(model.wv['learning'])

# 3. Find most similar semantic words!
print("\nMost similar to 'learning':")
print(model.wv.most_similar('learning', topn=3))

GloVe Vectors

GloVe (Global Vectors)

Created by Stanford researchers in 2014, GloVe (Global Vectors for Word Representation) sought to combine the best of both previous NLP worlds. It argues that while predictive models like Word2Vec are great, they only look at a small window of local context. They miss out on the global statistics of the entire document corpus.

Combining Two Philosophies

GloVe bridges two different text representation paradigms:

Matrix Factorization (Global): Like LSA relying on the giant Co-occurrence Matrix, which gives excellent global statistical frequencies but performs poorly on analogies.
Local Context Window (Predictive): Like Word2Vec, which learns excellent analogy math but fails to utilize global document statistics.

GloVe trains on the non-zero entries of a global word-word co-occurrence matrix, rather than on the entire sparse matrix or just separate local windows, optimizing a log-bilinear regression model.

The Ratio of Probabilities

The mathematical genius of GloVe lies in looking at ratios of co-occurrence probabilities rather than pure probabilities.

Let's look at the words Ice and Steam. We want to see how they interact with probe words like Solid and Gas.

Probability Ratio	Solid	Gas	Water (Shares both)	Fashion (Irrelevant)
`P(k \| Ice)`	High (~1.9x10^-4)	Low (~6.6x10^-5)	High	Low
`P(k \| Steam)`	Low (~2.2x10^-5)	High (~7.8x10^-4)	High	Low
Ratio: `P(Ice)/P(Steam)`	Large (8.9)	Small (0.08)	Neutral (~1.0)	Neutral (~1.0)

Takeaway: The ratio clearly discriminates the relevant thermodynamic properties (Solid vs Gas) while canceling out words that appear frequently with both (Water) or rarely with both (Fashion). GloVe forces the model to learn these ratios to encode meaning.

Loading Pre-Trained GloVe Vectors (using spaCy)

# Rather than training GloVe from scratch (which takes massive resources),
# industry standard is to download pre-trained vectors. 
# spaCy's 'en_core_web_md' or 'lg' come bundled with GloVe vectors.

import spacy

# Load Medium model (contains 20k GloVe vectors of 300 dimensions)
nlp = spacy.load("en_core_web_md")

word1 = nlp("king")
word2 = nlp("queen")
word3 = nlp("apple")

# Extract the dense 300D GloVe numpy array
print(f"Vector shape for 'king': {word1.vector.shape}")
# >>> (300,)

# spaCy automatically calculates Cosine Similarity using these GloVe vectors!
print(f"Similarity (King vs Queen): {word1.similarity(word2):.3f}")
# >>> 0.725

print(f"Similarity (King vs Apple): {word1.similarity(word3):.3f}")
# >>> 0.204

FastText

FastText: Subword Embeddings

Created by Facebook's AI Research (FAIR) lab in 2016, FastText is an extension of the Word2Vec model. While Word2Vec and GloVe treat every word as a distinct, atomic entity, FastText breaks words down into smaller pieces called character n-grams.

The Subword Breakdown Example

How does FastText view the word "apple" using an n-gram size of n=3?

FastText adds special boundary characters < and > to denote the beginning and end of a word.

Word: <apple>
N-grams (n=3): [ "<ap", "app", "ppl", "ple", "le>" ]

The final embedding for "apple" is the sum of the embeddings of all these little n-grams (plus the embedding for the whole word itself)!

                            Why is this Revolutionary?
                            Handles Typos: If a user types "appple", Word2Vec completely crashes because it has never seen that word. FastText succeeds because "appple" shares 80% of its subword n-grams with "apple".
Solves the OOV Problem: It can generate embeddings for Out-Of-Vocabulary (OOV) words it has never seen before, by summing their character parts.
Great for Morphological Languages: Highly effective for languages like Turkish or Finnish where words are formed by gluing together many suffixes.

                        

Gensim FastText Implementation

from gensim.models import FastText

corpus = [["hello", "world", "this", "is", "nlp"], 
          ["machine", "learning", "is", "awesome"]]

# Train FastText
# min_n and max_n control the character n-gram sizes
model = FastText(sentences=corpus, vector_size=10, 
                 window=3, min_count=1, min_n=3, max_n=6)

# The model has never seen "learnings", but it can 
# calculate a vector anyway based on "learn" + "ing" + "s"!
oov_word = "learnings"

# This works perfectly, unlike Word2Vec!
vector = model.wv[oov_word] 
print(f"Vector for {oov_word} generated successfully!")

ELMo Embeddings

ELMo: Contextual Embeddings

ELMo (Embeddings from Language Models), introduced in 2018 by AllenNLP, marked the critical turning point in NLP history: the shift from entirely Static Embeddings (Word2Vec/GloVe) to fully Contextual Embeddings.

The Core Problem with Static Embeddings: Polysemy
In Word2Vec, the word "bank" has exactly one mathematical vector. Whether it's "river bank" or "savings bank", Word2Vec outputs the exact same numbers. This is mathematically flawed because the meaning is entirely context-dependent!

How ELMo Solves This

ELMo does not use a fixed dictionary lookup. Instead, ELMo calculates the embedding for a word on-the-fly by looking at the entire sentence it lives in.

Contextual Output

Sentence A

"He deposited money in the bank."

Vector: [0.81, 0.22, -0.4...]

Sentence B

"He sat by the river bank."

Vector: [0.12, -0.99, 0.3...]

Different contexts = Completely different mathematical vectors for the exact same word!

The Architecture: Bi-Directional LSTM

ELMo uses a deep, 2-layer Bi-Directional LSTM (Long Short-Term Memory) neural network trained on a standard Language Modeling task (predicting the next word).

Forward Pass: Reads the sentence from left-to-right to understand past context.
Backward Pass: Reads the sentence from right-to-left to understand future context.

The final embedding is a weighted sum of the internal states obtained from these Bi-LSTMs. ELMo paved the way immediately for BERT and Transformer architectures.