NLP Tutorial

Semantic Analysis

Semantic similarity, word sense disambiguation, and frame semantics.

Semantic Similarity

Semantic Similarity

Semantic Similarity evaluates how closely related two pieces of text are in meaning, rather than relying on exact string or character duplication.

Common Similarity Metrics

1. Jaccard Similarity

A set/intersection overlap metric. Intersection(A,B) / Union(A,B).

2. Cosine Similarity

Measures the angle between two dense vectors. The gold standard for embeddings.

3. WordNet Path

Based on hierarchical taxonomy steps in a knowledge graph.

Cosine Similarity with Scikit-Learn
from sklearn.metrics.pairwise import cosine_similarity
# Example calculation between two vectors
similarity = cosine_similarity([vec_a], [vec_b])

Word Sense Disambiguation

Word Sense Disambiguation (WSD)

One of the most notoriously difficult problems in all of linguistics is Polysemy—the capacity for a single word to have multiple distinct meanings (senses). Word Sense Disambiguation (WSD) is the task of computational algorithms attempting to figure out which "sense" of a word is being used in a given sentence.

The Ambiguity Framework

Target Word: Bass

Sense 1: Music Context

"He played the bass guitar."

Sense 2: Animal Context

"I caught a massive bass while fishing."

The Lesk Algorithm (Knowledge-Based WSD)

Created by Michael Lesk in 1986, this is the classic dictionary-based algorithm for WSD. It relies heavily on taxonomies like WordNet. The core idea is incredibly elegant: Count the overlapping words between the context of the sentence and the dictionary definition of the senses.

How Lesk Calculates

Given the sentence: "We had grilled pine cones for dessert."

  1. Fetch Dictionary Definitions for "pine":
    • Sense A: "A kind of evergreen tree with needle-shaped leaves and cones."
    • Sense B: "To waste away through sorrow or illness."
  2. Define the Sentence Context: Context = {"grilled", "cones", "dessert"}
  3. Calculate Intersection Overlap: Compare Context to Definitions. Sense A overlaps on the word "cones" (Score=1). Sense B has no overlap (Score=0). The algorithm correctly assigns Sense A!

Modern NLP WSD

The Lesk Algorithm struggles heavily because dictionary definitions are notoriously short, resulting in low overlap scores (often 0 overlap for both senses).

Deep Contextualized Embeddings (like ELMo and BERT) effectively solved the WSD problem. Because they generate dynamic embeddings on the fly, the mathematical vector for "bass" in a fishing context is fundamentally separated in vector-space geometry from the word "bass" in a guitar context. We simply utilize a K-Nearest Neighbors classifier on the generated embedding space!

Frame Semantics & FrameNet

Frame Semantics & FrameNet

Frame Semantics is a rich theory of meaning developed by linguist Charles J. Fillmore. It proposes that to truly understand the meaning of a word, you must understand the entire cognitive background scenario (or "frame") that the word evokes.

The Core Insight

You cannot understand the word "to buy" in isolation. It only makes sense when you understand the entire commercial transaction scenario, which includes: a Buyer, a Seller, Goods, a Price, and a Place of Purchase.

Example: The COMMERCE_BUY Frame

Sentence: "Mary bought a laptop from the store for $1200."
Frame Element Role Value in Sentence
Buyer Mary
Goods a laptop
Seller the store
Money (Price) $1,200

Exploring FrameNet with NLTK

NLTK FrameNet Explorer
import nltk
from nltk.corpus import framenet as fn

# Look up a specific frame
commerce_frame = fn.frame('Commerce_buy')
print(f"Frame: {commerce_frame.name}")
print(f"Description: {commerce_frame.definition}")