Semantic Analysis
Semantic similarity, word sense disambiguation, and frame semantics.
Semantic Similarity
Semantic Similarity
Semantic Similarity evaluates how closely related two pieces of text are in meaning, rather than relying on exact string or character duplication.
Common Similarity Metrics
1. Jaccard Similarity
A set/intersection overlap metric. Intersection(A,B) / Union(A,B).
2. Cosine Similarity
Measures the angle between two dense vectors. The gold standard for embeddings.
3. WordNet Path
Based on hierarchical taxonomy steps in a knowledge graph.
from sklearn.metrics.pairwise import cosine_similarity
# Example calculation between two vectors
similarity = cosine_similarity([vec_a], [vec_b])
Word Sense Disambiguation
Word Sense Disambiguation (WSD)
One of the most notoriously difficult problems in all of linguistics is Polysemy—the capacity for a single word to have multiple distinct meanings (senses). Word Sense Disambiguation (WSD) is the task of computational algorithms attempting to figure out which "sense" of a word is being used in a given sentence.
The Ambiguity Framework
Target Word: Bass
Sense 1: Music Context
"He played the bass guitar."
Sense 2: Animal Context
"I caught a massive bass while fishing."
The Lesk Algorithm (Knowledge-Based WSD)
Created by Michael Lesk in 1986, this is the classic dictionary-based algorithm for WSD. It relies heavily on taxonomies like WordNet. The core idea is incredibly elegant: Count the overlapping words between the context of the sentence and the dictionary definition of the senses.
How Lesk Calculates
Given the sentence: "We had grilled pine cones for dessert."
- Fetch Dictionary Definitions for "pine":
- Sense A: "A kind of evergreen tree with needle-shaped leaves and cones."
- Sense B: "To waste away through sorrow or illness."
- Define the Sentence Context:
Context = {"grilled", "cones", "dessert"} - Calculate Intersection Overlap: Compare Context to Definitions. Sense A overlaps on the word "cones" (Score=1). Sense B has no overlap (Score=0). The algorithm correctly assigns Sense A!
Modern NLP WSD
The Lesk Algorithm struggles heavily because dictionary definitions are notoriously short, resulting in low overlap scores (often 0 overlap for both senses).
Deep Contextualized Embeddings (like ELMo and BERT) effectively solved the WSD problem. Because they generate dynamic embeddings on the fly, the mathematical vector for "bass" in a fishing context is fundamentally separated in vector-space geometry from the word "bass" in a guitar context. We simply utilize a K-Nearest Neighbors classifier on the generated embedding space!
Frame Semantics & FrameNet
Frame Semantics & FrameNet
Frame Semantics is a rich theory of meaning developed by linguist Charles J. Fillmore. It proposes that to truly understand the meaning of a word, you must understand the entire cognitive background scenario (or "frame") that the word evokes.
The Core Insight
You cannot understand the word "to buy" in isolation. It only makes sense when you understand the entire commercial transaction scenario, which includes: a Buyer, a Seller, Goods, a Price, and a Place of Purchase.
Example: The COMMERCE_BUY Frame
Sentence: "Mary bought a laptop from the store for $1200."
| Frame Element Role | Value in Sentence |
|---|---|
| Buyer | Mary |
| Goods | a laptop |
| Seller | the store |
| Money (Price) | $1,200 |
Exploring FrameNet with NLTK
import nltk
from nltk.corpus import framenet as fn
# Look up a specific frame
commerce_frame = fn.frame('Commerce_buy')
print(f"Frame: {commerce_frame.name}")
print(f"Description: {commerce_frame.definition}")