Gensim Tutorial

Gensim

Open-source Python library for unsupervised topic modeling and statistical semantic vector space modeling.

Previous: spaCy

Gensim: Topic Modeling for Humans

Gensim (meaning "Generate Similar") is designed to process extremely large text collections using data-streaming algorithms without needing to load the entire dataset into RAM. It is arguably the most famous library in the world for Topic Modeling (Latent Dirichlet Allocation) and Static Word Embeddings (Word2Vec).

Level 1 — Training Your Own Word2Vec Details

Pre-trained word embeddings from Google are great, but what if your company works in a heavily specialized medical or legal field? General embeddings won't understand your domain-specific jargon! Gensim allows you to easily train an embedding model strictly on your own documents.

Training Word2Vec from Scratch

from gensim.models import Word2Vec

# Step 1: Prepare your data as a list of lists of words (tokenized sentences)
# In reality, this would be a massive stream of millions of sentences from a medical journal.
corpus_sentences = [
    ['the', 'patient', 'presented', 'with', 'severe', 'cardiac', 'hypertrophy'],
    ['echocardiogram', 'showed', 'thickening', 'of', 'ventricular', 'walls'],
    ['prescribed', 'beta', 'blockers', 'to', 'reduce', 'cardiac', 'strain'],
    ['patient', 'responded', 'well', 'to', 'the', 'prescribed', 'medication']
]

# Step 2: Train the Model
# vector_size = math dimensions (usually 100-300)
# window = how many adjacent words to look at for context
# min_count = ignores words that appear less than this many times
model = Word2Vec(sentences=corpus_sentences, vector_size=50, window=3, min_count=1, workers=4)

# Step 3: Extract the Mathematical Vectors
vector_for_cardiac = model.wv['cardiac']
print(f"Vector (first 5 dims): {vector_for_cardiac[:5]}")

# Step 4: Find Semantic Similarities within your Custom Domain!
similar_words = model.wv.most_similar('cardiac', topn=2)
print("Most similar to 'cardiac':", similar_words)
# e.g., Output: [('hypertrophy', 0.1554), ('strain', 0.1203)] -> It learned medical context!

Level 2 — Topic Modeling with LDA

Latent Dirichlet Allocation (LDA) automatically groups thousands of articles into different "topics" by statistically analyzing which words frequently co-occur together in the same documents.

Topic Discovery using LDA

from gensim import corpora, models

# Suppose we have three documents: one about tech, one about finance, one about sports.
doc_texts = [
    ["apple", "iphone", "release", "software", "update", "battery"],
    ["federal", "reserve", "interest", "rate", "inflation", "market", "economy"],
    ["team", "coach", "football", "goal", "championship", "stadium"]
]

# 1. Create a dictionary representation of the documents (maps words to integer IDs)
dictionary = corpora.Dictionary(doc_texts)

# 2. Convert dictionary to a Bag-of-Words Corpus (List of (token_id, token_count) tuples)
corpus = [dictionary.doc2bow(text) for text in doc_texts]

# 3. Train the LDA Model
# We ask it to find 3 distinct abstract topics hidden inside the corpus
lda_model = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=3, passes=10, random_state=42)

# 4. View the topics it discovered!
print("The 3 discovered topics are \n" + "-" * 30)
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic: {idx} \nWords: {topic}\n")
    
# Topic: 0 -> 0.143*"federal" + 0.143*"reserve" + 0.143*"interest" + 0.143*"rate" + etc.

pyLDAvis Integration: Gensim works flawlessly with a library called pyLDAvis, which generates a highly interactive, D3.js powered bubble chart of all the different topics discovered, allowing you to visually explore the cluster distributions in a web browser!

Previous: spaCy