ELMo Embeddings Tutorial Section

ELMo Embeddings

Embeddings from Language Models: The transition from static embeddings to deep contextualized word representations using BiLSTMs.

ELMo: Contextual Embeddings

ELMo (Embeddings from Language Models), introduced in 2018 by AllenNLP, marked the critical turning point in NLP history: the shift from entirely Static Embeddings (Word2Vec/GloVe) to fully Contextual Embeddings.

The Core Problem with Static Embeddings: Polysemy
In Word2Vec, the word "bank" has exactly one mathematical vector. Whether it's "river bank" or "savings bank", Word2Vec outputs the exact same numbers. This is mathematically flawed because the meaning is entirely context-dependent!

How ELMo Solves This

ELMo does not use a fixed dictionary lookup. Instead, ELMo calculates the embedding for a word on-the-fly by looking at the entire sentence it lives in.

Contextual Output

Sentence A

"He deposited money in the bank."

Vector: [0.81, 0.22, -0.4...]

Sentence B

"He sat by the river bank."

Vector: [0.12, -0.99, 0.3...]

Different contexts = Completely different mathematical vectors for the exact same word!

The Architecture: Bi-Directional LSTM

ELMo uses a deep, 2-layer Bi-Directional LSTM (Long Short-Term Memory) neural network trained on a standard Language Modeling task (predicting the next word).

  • Forward Pass: Reads the sentence from left-to-right to understand past context.
  • Backward Pass: Reads the sentence from right-to-left to understand future context.

The final embedding is a weighted sum of the internal states obtained from these Bi-LSTMs. ELMo paved the way immediately for BERT and Transformer architectures.