Discourse Analysis
Discourse relations, coreference resolution, and text coherence.
Discourse Analysis
Discourse Analysis Overview
All the NLP techniques we've studied so far operate primarily at the sentence level. Discourse Analysis zooms out to study how sentences connect and relate to form coherent paragraphs, conversations, documents, and entire texts.
A sequence of grammatically perfect individual sentences does not automatically make a coherent text. Discourse analysis identifies the hidden logical glue that holds connected text together.
"Roses are red. Quantum mechanics describes particle physics. My cat is named Whiskers."
Three perfectly valid sentences. Zero discourse coherence. A good discourse model will assign this a very low coherence score.
Rhetorical Structure Theory (RST): The Backbone of Discourse
RST is the most influential theory for computational discourse analysis. It proposes that coherent texts can be represented as a hierarchical tree of nuclei and satellites linked by specific rhetorical relations.
Nucleus
The core element — the most essential piece of information. If removed, the text loses its main point. In the sentence pair "The system crashed [N] because of a memory overflow [S]", the Nucleus is the crash event.
Satellite
The supporting element — it elaborates or fills in context around the nucleus. It helps the nucleus but is not itself the main point. The cause ("memory overflow") is the Satellite.
Common Rhetorical (Discourse) Relations
| Relation | Meaning | Connecting Word Example |
|---|---|---|
| CAUSE | Satellite is the reason for the Nucleus event. | "because", "due to" |
| CONTRAST | Two nuclei are presented as opposing ideas. | "however", "but", "whereas" |
| ELABORATION | Satellite gives more detail about the Nucleus. | "specifically", "for example" |
| EVIDENCE | Satellite provides factual support for the Nucleus claim. | "as shown by", "data indicates" |
| CONCESSION | Satellite acknowledges something that seems to conflict with Nucleus. | "although", "even though" |
| CONDITION | Nucleus event is conditional upon the Satellite. | "if", "provided that", "unless" |
Discourse Segmentation: EDUs
The first step in computational discourse analysis is breaking text into the smallest possible meaning-bearing units called Elementary Discourse Units (EDUs). These are typically individual clauses.
Segmenting into EDUs
"The company's profits fell sharply last year, largely because they failed to innovate, and subsequently they had to lay off 500 employees."
Segmented into 3 EDUs:
- EDU 1 "The company's profits fell sharply last year,"
- EDU 2 "largely because they failed to innovate,"
- EDU 3 "and subsequently they had to lay off 500 employees."
Relations: EDU2 CAUSE → EDU1; EDU3 is the RESULT of EDU1.
Coreference Resolution
Coreference Resolution Overview
Human writing is full of pronouns and abbreviated references. We constantly use words like "he", "she", "it", "they", "the company", and "the researcher" to refer back to entities already introduced in the text. Coreference Resolution is the task of finding all of these expressions that point to the same real-world entity and clustering them together.
Why it Matters Critically
Without coreference resolution, a machine reading the paragraph "Elon Musk founded Tesla. He later started SpaceX. The entrepreneur is now the richest person in the world." would treat "Elon Musk", "He", and "The entrepreneur" as three completely different people. Coreference resolution correctly merges them into a single entity cluster.
Key Terminology
Mention
Any noun phrase or pronoun in the text that could refer to an entity. Every "he", "she", "Amazon", "the company" is a candidate mention that needs to be resolved.
Antecedent
The earlier-occurring mention that a pronoun points back to. In "John ate his lunch", "John" is the antecedent of "his".
Coreference Chain
A complete cluster of all mentions that refer to the same entity.
Chain #1: {Elon Musk, He, The entrepreneur}.
A Worked Example
Input Text Analysis
"Amazon announced a new service today. The e-commerce giant said it will create 10,000 jobs. The company will begin hiring next quarter."
Resolved Coreference Chain
All 4 mentions correctly resolve to the same entity: Amazon.
Modern Approach: Neural Mention-Ranking
State-of-the-art coreference resolution uses a neural model (e.g. SpanBERT) that scores all possible pairs of mentions in a document to determine which ones are most likely to corefer. It ranks candidate antecedents for each mention and picks the best-scoring one.
import spacy
import neuralcoref # pip install neuralcoref
nlp = spacy.load('en_core_web_sm')
neuralcoref.add_to_pipe(nlp)
text = "Amazon announced a new service. The company said it will create 10,000 jobs."
doc = nlp(text)
if doc._.has_coref:
print("Coreference Clusters Found:")
for cluster in doc._.coref_clusters:
print(f" Cluster: {[str(m) for m in cluster.mentions]}")
print(f" Main: '{cluster.main}'")
# Output:
# Coreference Clusters Found:
# Cluster: ['Amazon', 'The company', 'it']
# Main: 'Amazon'
Text Coherence
Understanding Text Coherence
While cohesion deals with the grammatical and lexical ties between individual sentences (e.g. pronouns), coherence refers to the logical and semantic flow of the entire document. A coherent text feels like a unified whole where every sentence contributes meaningfully to the overall theme.
Local Cohesion
How adjacent sentences connect using reference (pronouns), substitution, ellipsis, and conjunctions. It's the "surface level" connectivity.
Global Coherence
The high-level organization. Does the text follow a logical progression (e.g. Chronological, Problem-Solution, General-to-Specific)?
Computational Models of Coherence
Measuring coherence automatically is vital for tasks like automated essay grading and summarization evaluation. Key computational models include:
The Entity Grid Model
Introduced by Barzilay and Lapata (2008), this model represents a document as a grid where rows are sentences and columns are entities. Each cell tracks the grammatical role (Subject, Object, None) of an entity in a sentence.
| Sentence | Elon Musk | Tesla | SpaceX |
|---|---|---|---|
| S1 | Subject | Object | - |
| S2 | Subject | - | Object |
| S3 | Subject | - | - |
A coherent text will show patterns of entity transitions (e.g., Subject → Subject) that are statistically likely in well-written documents.
1. Centering Theory
Centering Theory (Grosz, Joshi & Weinstein, 1995) tracks the most prominent entity that is "in focus" as the reader moves from sentence to sentence. It predicts that coherence is high when:
- The "center" (main entity) of a sentence is also the subject of the following sentence.
- The center changes as little as possible between adjacent sentences.
Sentence 2: "It ran across the park." → Center = Dog ✅ (High Coherence!)
Sentence 2': "The cat slept on the sofa." → Center shifts to Cat ⌠(Low Coherence)
import torch
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
# Load GPT-2
model_id = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_id)
tokenizer = GPT2TokenizerFast.from_pretrained(model_id)
model.eval()
def perplexity(text):
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
loss = model(**inputs, labels=inputs["input_ids"]).loss
return torch.exp(loss).item()
# Coherent paragraph
coherent = "The moon is Earth's only natural satellite. It formed 4.5 billion years ago. Scientists believe it resulted from a giant impact."
# Shuffled (incoherent) paragraph
incoherent = "Scientists believe it resulted from a giant impact. The moon is Earth's only natural satellite. It formed 4.5 billion years ago."
print(f"Coherent Perplexity: {perplexity(coherent):.2f} (LOWER = MORE COHERENT)")
print(f"Incoherent Perplexity: {perplexity(incoherent):.2f} (HIGHER = LESS COHERENT)")
# Coherent Perplexity: 43.21 (LOWER = MORE COHERENT)
# Incoherent Perplexity: 89.56 (HIGHER = LESS COHERENT)