Word2Vec
Deep dive into Google's Word2Vec architecture (CBOW and Skip-Gram) and how neural networks learn semantics.
Word2Vec Formulations
Introduced by Tomas Mikolov at Google in 2013, Word2Vec is not a single algorithm, but a 2-layer Neural Network framework containing two distinct architectural flavors: CBOW and Skip-Gram. The secret? It learns embeddings as a byproduct of a "fake" classification task.
1. Continuous Bag of Words (CBOW)
Predicts the Target word from the Context.
"The fox ___ over the dog"Neural Net Predicts: "jumps"
CBOW is several times faster to train than Skip-Gram and has slightly better accuracy for frequent words.
2. Skip-Gram
Predicts Context words from a given Target word.
"___ ___ jumps ___ ___"Neural Net Predicts: "fox", "over", "dog"
Skip-Gram is slower, but it works extremely well with small amounts of training data and handles rare words exceptionally well.
The Magic of Vector Math
Word2Vec famously proved that its continuous vector space accurately captured logical relational analogies through simple linear algebra additions and subtractions.
Queen
Rome
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
# Sample corpus (should be massive in reality!)
corpus = [
"Machine learning is the study of computer algorithms.",
"Deep learning is a subset of machine learning using neural networks.",
"Artificial intelligence creates smart machines."
]
# Tokenize sentences into lists of words
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]
# 1. Train Word2Vec Model
# vector_size: dimension of embedding (usually 100-300)
# window: context window size
# sg: 1 for Skip-Gram, 0 for CBOW
model = Word2Vec(sentences=tokenized_corpus, vector_size=10, window=3, min_count=1, sg=1)
# 2. Extract the learned vector for a word
print("Vector for 'learning':")
print(model.wv['learning'])
# 3. Find most similar semantic words!
print("\nMost similar to 'learning':")
print(model.wv.most_similar('learning', topn=3))