GloVe Vectors Tutorial Section

GloVe Vectors

Global Vectors for Word Representation: merging the benefits of matrix factorization with local context window architectures.

GloVe (Global Vectors)

Created by Stanford researchers in 2014, GloVe (Global Vectors for Word Representation) sought to combine the best of both previous NLP worlds. It argues that while predictive models like Word2Vec are great, they only look at a small window of local context. They miss out on the global statistics of the entire document corpus.

Combining Two Philosophies

GloVe bridges two different text representation paradigms:

  1. Matrix Factorization (Global): Like LSA relying on the giant Co-occurrence Matrix, which gives excellent global statistical frequencies but performs poorly on analogies.
  2. Local Context Window (Predictive): Like Word2Vec, which learns excellent analogy math but fails to utilize global document statistics.

GloVe trains on the non-zero entries of a global word-word co-occurrence matrix, rather than on the entire sparse matrix or just separate local windows, optimizing a log-bilinear regression model.

The Ratio of Probabilities

The mathematical genius of GloVe lies in looking at ratios of co-occurrence probabilities rather than pure probabilities.

Let's look at the words Ice and Steam. We want to see how they interact with probe words like Solid and Gas.

Probability Ratio Solid Gas Water (Shares both) Fashion (Irrelevant)
P(k | Ice) High (~1.9x10-4) Low (~6.6x10-5) High Low
P(k | Steam) Low (~2.2x10-5) High (~7.8x10-4) High Low
Ratio: P(Ice)/P(Steam) Large (8.9) Small (0.08) Neutral (~1.0) Neutral (~1.0)

Takeaway: The ratio clearly discriminates the relevant thermodynamic properties (Solid vs Gas) while canceling out words that appear frequently with both (Water) or rarely with both (Fashion). GloVe forces the model to learn these ratios to encode meaning.

Loading Pre-Trained GloVe Vectors (using spaCy)
# Rather than training GloVe from scratch (which takes massive resources),
# industry standard is to download pre-trained vectors. 
# spaCy's 'en_core_web_md' or 'lg' come bundled with GloVe vectors.

import spacy

# Load Medium model (contains 20k GloVe vectors of 300 dimensions)
nlp = spacy.load("en_core_web_md")

word1 = nlp("king")
word2 = nlp("queen")
word3 = nlp("apple")

# Extract the dense 300D GloVe numpy array
print(f"Vector shape for 'king': {word1.vector.shape}")
# >>> (300,)

# spaCy automatically calculates Cosine Similarity using these GloVe vectors!
print(f"Similarity (King vs Queen): {word1.similarity(word2):.3f}")
# >>> 0.725

print(f"Similarity (King vs Apple): {word1.similarity(word3):.3f}")
# >>> 0.204