Co-occurrence Matrix Tutorial Section

Co-occurrence Matrix

Understand how words appearing together in a specific window of context creates rich semantic statistical models.

Co-occurrence Matrix

Bag-of-Words and TF-IDF create Document-Term matrices (Documents as rows, Words as columns). In contrast, a Co-occurrence Matrix creates a Word-Word matrix (Words as rows, Words as columns). It captures how often two different words appear together within a specific "window" distance in a sentence.

This follows the distribution hypothesis by famous linguist J.R. Firth: "You shall know a word by the company it keeps." Words that appear in similar contexts usually share semantic meaning.

How it works: The Context Window

Assume a corpus with one sentence: "deep learning is incredibly exciting"

If we set our Window Size = 1 (look 1 word left, 1 word right), we scan the text:

Focus on "learning": Left is "deep", Right is "is".
Add +1 to coordinates (learning, deep) and (learning, is) in the matrix.

	deep	learning	is	incredibly	exciting
deep	0	1	0	0	0
learning	1	0	1	0	0
is	0	1	0	1	0
incredibly	0	0	1	0	1
exciting	0	0	0	1	0

                            Advantages
                            Preserves profound semantic relationships (unlike BoW).
Vectors from this matrix possess geometric meaning. Synonyms clustered together in mathematical space.
Forms the fundamental mathematical backbone for GloVe embeddings and Latent Semantic Analysis (LSA).

                        

Disadvantages

Memory Intensive: Matrix size is Vocab x Vocab. If V=100,000, you need an array with 10 Billion elements! (Usually requires Sparse Matrices).
Requires Singular Value Decomposition (SVD) dimensionality reduction to be practically useful in modeling.

Note on Context Window: A small window size (1-2) identifies words that are grammatically interchangeable (e.g., "dog" and "cat" are both followed by "barks" or "meows"). A large window size (5-10) identifies topically related words (e.g., "doctor" and "hospital" appear in the same paragraph but not necessarily adjacent).

Previous: TF-IDF Matrix