Word2Vec Tutorial Section

Word2Vec

Deep dive into Google's Word2Vec architecture (CBOW and Skip-Gram) and how neural networks learn semantics.

Word2Vec Formulations

Introduced by Tomas Mikolov at Google in 2013, Word2Vec is not a single algorithm, but a 2-layer Neural Network framework containing two distinct architectural flavors: CBOW and Skip-Gram. The secret? It learns embeddings as a byproduct of a "fake" classification task.

1. Continuous Bag of Words (CBOW)

Predicts the Target word from the Context.

Given the context: "The fox ___ over the dog"

Neural Net Predicts: "jumps"

CBOW is several times faster to train than Skip-Gram and has slightly better accuracy for frequent words.

2. Skip-Gram

Predicts Context words from a given Target word.

Given the target: "___ ___ jumps ___ ___"

Neural Net Predicts: "fox", "over", "dog"

Skip-Gram is slower, but it works extremely well with small amounts of training data and handles rare words exceptionally well.

The Magic of Vector Math

Word2Vec famously proved that its continuous vector space accurately captured logical relational analogies through simple linear algebra additions and subtractions.

King - Man + Woman

Queen
Paris - France + Italy

Rome
Training Word2Vec using Python Gensim
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Sample corpus (should be massive in reality!)
corpus = [
    "Machine learning is the study of computer algorithms.",
    "Deep learning is a subset of machine learning using neural networks.",
    "Artificial intelligence creates smart machines."
]

# Tokenize sentences into lists of words
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]

# 1. Train Word2Vec Model
# vector_size: dimension of embedding (usually 100-300)
# window: context window size
# sg: 1 for Skip-Gram, 0 for CBOW
model = Word2Vec(sentences=tokenized_corpus, vector_size=10, window=3, min_count=1, sg=1)

# 2. Extract the learned vector for a word
print("Vector for 'learning':")
print(model.wv['learning'])

# 3. Find most similar semantic words!
print("\nMost similar to 'learning':")
print(model.wv.most_similar('learning', topn=3))