Embeddings Q&A

Word embeddings โ€“ short Q&A

20 questions and answers on dense vector representations of words, distributional semantics and how embeddings support modern NLP models.

1

What are word embeddings in NLP?

Answer: Word embeddings are dense, low-dimensional numeric vectors that represent words such that similar words have similar vector representations based on their distributional context.

2

How do embeddings differ from one-hot vectors?

Answer: One-hot vectors are high-dimensional and sparse, with no notion of similarity between words, while embeddings are dense and encode semantic relations through geometric proximity in vector space.

3

What is the distributional hypothesis behind embeddings?

Answer: The distributional hypothesis states that words that occur in similar contexts tend to have similar meanings; embedding models operationalize this by learning vectors that predict or are predicted by neighboring words.

4

How are embeddings typically trained?

Answer: Classic models like word2vec and GloVe learn embeddings from large unlabeled corpora by optimizing objectives such as predicting context words or factorizing co-occurrence statistics.

5

What is cosine similarity, and why is it used with embeddings?

Answer: Cosine similarity measures the angle between two vectors; it is commonly used to compare embeddings because it reflects direction (semantic similarity) regardless of vector magnitude.

6

What is vector arithmetic in the context of embeddings?

Answer: Vector arithmetic refers to operations like embedding("king") - embedding("man") + embedding("woman") โ‰ˆ embedding("queen"), indicating that embeddings can capture analogical relations as linear directions.

7

What is the difference between static and contextual embeddings?

Answer: Static embeddings assign one vector per word type, while contextual embeddings (from models like BERT) produce different vectors for the same token depending on its surrounding context.

8

How can we use pre-trained embeddings in downstream tasks?

Answer: Pre-trained embeddings can initialize the embedding layer of neural networks or be used as frozen features, providing rich semantic information even with limited labeled data.

9

What is OOV (out-of-vocabulary) in the context of embeddings?

Answer: OOV words are tokens that were not seen in the training corpus of the embedding model, so they lack vectors; strategies include using an <UNK> embedding or subword-based models like fastText.

10

How do embeddings handle polysemy (multiple senses)?

Answer: Static embeddings conflate senses into a single vector, while contextual embeddings can distinguish senses by producing different vectors for each usage, better capturing meaning in context.

11

What are some common evaluation tasks for embeddings?

Answer: Embeddings are often evaluated on word similarity benchmarks, analogy tasks or as features in downstream tasks like sentiment classification to see if they improve performance.

12

How do biases appear in word embeddings?

Answer: Because embeddings learn from real-world text, they can encode societal biases (e.g. gender or racial stereotypes), which may propagate or amplify unfair associations in NLP systems.

13

What techniques exist to mitigate bias in embeddings?

Answer: Methods include debiasing projections, equalizing pairs, removing specific bias directions and training on curated corpora, although fully eliminating bias remains challenging.

14

What is the role of dimensionality in embedding vectors?

Answer: Higher dimensions can capture more nuanced patterns but increase model size and risk overfitting; typical static embeddings range from 50 to 300 dimensions, while transformer hidden states are often larger.

15

How are embeddings visualized for interpretation?

Answer: Techniques such as PCA or t-SNE project high-dimensional embeddings into 2D or 3D space, allowing us to see clusters and relationships among words in plots.

16

Why do we often fine-tune embeddings on specific tasks?

Answer: Fine-tuning lets embeddings adapt from general corpus patterns to task-specific nuances, improving performance on the target dataset while starting from a strong pre-trained initialization.

17

How do subword embeddings (e.g. fastText) differ from word-level embeddings?

Answer: Subword models represent words as compositions of character n-gram embeddings, allowing them to generate vectors for unseen words and better handle morphology and misspellings.

18

How are embeddings used in sequence models like RNNs or transformers?

Answer: Embeddings form the input layer: each token index is mapped to its embedding vector, which is then fed into RNNs, CNNs or transformer blocks for contextual processing.

19

Can embeddings be learned jointly with other model parameters?

Answer: Yes, in most neural networks the embedding matrix is trained end-to-end with the rest of the model, allowing task-specific gradients to refine the representations.

20

Why are contextual embeddings preferred in many modern NLP applications?

Answer: Contextual embeddings capture the meaning of a word in its specific context, handling polysemy and subtle usage differences much better than single static vectors per word type.

๐Ÿ” Embeddings concepts covered

This page covers word embeddings: distributional semantics, similarity measures, vector arithmetic, bias concerns and how embeddings power modern NLP architectures.

Dense vector spaces
Similarity & cosine
Static vs contextual
Analogies & arithmetic
Pre-trained vectors
Bias & visualization