Co-occurrence Matrix
Understand how words appearing together in a specific window of context creates rich semantic statistical models.
Co-occurrence Matrix
Bag-of-Words and TF-IDF create Document-Term matrices (Documents as rows, Words as columns). In contrast, a Co-occurrence Matrix creates a Word-Word matrix (Words as rows, Words as columns). It captures how often two different words appear together within a specific "window" distance in a sentence.
This follows the distribution hypothesis by famous linguist J.R. Firth: "You shall know a word by the company it keeps." Words that appear in similar contexts usually share semantic meaning.
How it works: The Context Window
Assume a corpus with one sentence: "deep learning is incredibly exciting"
If we set our Window Size = 1 (look 1 word left, 1 word right), we scan the text:
- Focus on "learning": Left is "deep", Right is "is".
- Add +1 to coordinates (learning, deep) and (learning, is) in the matrix.
| deep | learning | is | incredibly | exciting | |
|---|---|---|---|---|---|
| deep | 0 | 1 | 0 | 0 | 0 |
| learning | 1 | 0 | 1 | 0 | 0 |
| is | 0 | 1 | 0 | 1 | 0 |
| incredibly | 0 | 0 | 1 | 0 | 1 |
| exciting | 0 | 0 | 0 | 1 | 0 |
Advantages
- Preserves profound semantic relationships (unlike BoW).
- Vectors from this matrix possess geometric meaning. Synonyms clustered together in mathematical space.
- Forms the fundamental mathematical backbone for GloVe embeddings and Latent Semantic Analysis (LSA).
Disadvantages
- Memory Intensive: Matrix size is Vocab x Vocab. If V=100,000, you need an array with 10 Billion elements! (Usually requires Sparse Matrices).
- Requires Singular Value Decomposition (SVD) dimensionality reduction to be practically useful in modeling.