Word2Vec – short Q&A
20 questions and answers on Word2Vec, explaining CBOW and skip-gram architectures, negative sampling and key training considerations.
What is Word2Vec?
Answer: Word2Vec is a family of neural language models that learn word embeddings by predicting words from their context (CBOW) or predicting context words from a target word (skip-gram).
What is the Continuous Bag-of-Words (CBOW) model?
Answer: CBOW predicts a target word given its surrounding context words by averaging or summing their embeddings and using the result to classify which word should appear in the center.
What is the skip-gram model in Word2Vec?
Answer: Skip-gram takes a single target word as input and tries to predict its surrounding context words within a window, learning embeddings that are good at generating contextual neighbors.
How does negative sampling speed up training?
Answer: Negative sampling replaces full softmax over the vocabulary with a small number of sampled “negative” words, so each update only adjusts a few output weights instead of all vocabulary entries.
What is hierarchical softmax in Word2Vec?
Answer: Hierarchical softmax uses a binary tree over the vocabulary; predicting a word becomes a sequence of binary decisions along the tree path, reducing complexity from O(V) to O(log V) per update.
Which model, CBOW or skip-gram, tends to work better on rare words?
Answer: Skip-gram generally performs better on rare words because it treats each target word separately and learns from multiple context positions, giving more training signal to infrequent tokens.
What is the role of the context window size in Word2Vec?
Answer: The window size controls how many surrounding words are considered context; small windows focus on syntactic relations, while larger windows capture broader semantic or topical associations.
Why is subsampling of frequent words used in Word2Vec?
Answer: Frequent words like stopwords can dominate training and slow convergence; subsampling randomly discards some occurrences, speeding up learning and improving embedding quality for less frequent terms.
What loss function is typically optimized in skip-gram with negative sampling?
Answer: Skip-gram with negative sampling optimizes a logistic regression objective where the model maximizes the probability of true (target, context) pairs and minimizes it for sampled negative pairs.
How does Word2Vec relate to matrix factorization?
Answer: It has been shown that the skip-gram with negative sampling objective implicitly factorizes a shifted pointwise mutual information (PMI) matrix, connecting Word2Vec to distributional count models.
What happens to Word2Vec embeddings if the training corpus is very small?
Answer: With limited text, the model may not see enough diverse contexts to learn reliable embeddings, leading to noisy or unstable vectors, especially for rare words or phrases.
Can Word2Vec be trained on subword units?
Answer: Vanilla Word2Vec works at the word level, but variants like fastText extend the idea to character n-grams, effectively combining Word2Vec-style training with subword modeling.
How do we use trained Word2Vec embeddings in other models?
Answer: The learned embedding matrix can be exported and used to initialize embedding layers in downstream neural networks or to compute average/summed document vectors for simpler classifiers.
What are some practical hyperparameters in Word2Vec training?
Answer: Key hyperparameters include vector dimension, window size, number of negative samples, subsampling threshold, learning rate and minimum word frequency for inclusion in the vocabulary.
What is the effect of increasing the embedding dimension in Word2Vec?
Answer: Higher dimensions can capture more subtle relationships but require more data and computation; too high without enough data can lead to overfitting and noisy vectors.
Why is Word2Vec considered a shallow model?
Answer: Word2Vec uses a single hidden layer with linear transformations and a simple objective; there is no deep stack of nonlinear layers like in modern transformer architectures.
How does Word2Vec compare to more recent contextual models?
Answer: Word2Vec produces static type-level embeddings, while contextual models like BERT generate context-dependent embeddings; contextual models typically outperform Word2Vec on complex NLP tasks.
Can Word2Vec be fine-tuned on a specific domain?
Answer: Yes, pre-trained Word2Vec embeddings can be further trained on in-domain text to adapt them to domain-specific vocabulary and semantics, though care is needed to avoid overfitting.
What tools or libraries commonly support Word2Vec training?
Answer: Libraries like Gensim, spaCy and some deep learning frameworks provide implementations or wrappers for training Word2Vec and working with pre-trained word2vec-format embeddings.
Why is Word2Vec still relevant despite newer models?
Answer: Word2Vec embeddings are lightweight, easy to train and interpret, and they offer a strong starting point for many applications, especially when compute or data is limited compared to large transformers.
🔍 Word2Vec concepts covered
This page covers Word2Vec: CBOW and skip-gram architectures, negative sampling, subsampling, hyperparameters and how Word2Vec embeddings connect to modern NLP practice.