Discourse Analysis

Discourse Analysis Overview

All the NLP techniques we've studied so far operate primarily at the sentence level. Discourse Analysis zooms out to study how sentences connect and relate to form coherent paragraphs, conversations, documents, and entire texts.

A sequence of grammatically perfect individual sentences does not automatically make a coherent text. Discourse analysis identifies the hidden logical glue that holds connected text together.

An Example of Discourse Breakdown:
"Roses are red. Quantum mechanics describes particle physics. My cat is named Whiskers."
Three perfectly valid sentences. Zero discourse coherence. A good discourse model will assign this a very low coherence score.

Rhetorical Structure Theory (RST): The Backbone of Discourse

RST is the most influential theory for computational discourse analysis. It proposes that coherent texts can be represented as a hierarchical tree of nuclei and satellites linked by specific rhetorical relations.

Nucleus

The core element â€” the most essential piece of information. If removed, the text loses its main point. In the sentence pair "The system crashed [N] because of a memory overflow [S]", the Nucleus is the crash event.

Satellite

The supporting element â€” it elaborates or fills in context around the nucleus. It helps the nucleus but is not itself the main point. The cause ("memory overflow") is the Satellite.

Common Rhetorical (Discourse) Relations

Relation	Meaning	Connecting Word Example
CAUSE	Satellite is the reason for the Nucleus event.	"because", "due to"
CONTRAST	Two nuclei are presented as opposing ideas.	"however", "but", "whereas"
ELABORATION	Satellite gives more detail about the Nucleus.	"specifically", "for example"
EVIDENCE	Satellite provides factual support for the Nucleus claim.	"as shown by", "data indicates"
CONCESSION	Satellite acknowledges something that seems to conflict with Nucleus.	"although", "even though"
CONDITION	Nucleus event is conditional upon the Satellite.	"if", "provided that", "unless"

Discourse Segmentation: EDUs

The first step in computational discourse analysis is breaking text into the smallest possible meaning-bearing units called Elementary Discourse Units (EDUs). These are typically individual clauses.

Segmenting into EDUs

"The company's profits fell sharply last year, largely because they failed to innovate, and subsequently they had to lay off 500 employees."

Segmented into 3 EDUs:

EDU 1 "The company's profits fell sharply last year,"
EDU 2 "largely because they failed to innovate,"
EDU 3 "and subsequently they had to lay off 500 employees."

Relations: EDU2 CAUSE â†’ EDU1; EDU3 is the RESULT of EDU1.

Coreference Resolution

Coreference Resolution Overview

Human writing is full of pronouns and abbreviated references. We constantly use words like "he", "she", "it", "they", "the company", and "the researcher" to refer back to entities already introduced in the text. Coreference Resolution is the task of finding all of these expressions that point to the same real-world entity and clustering them together.

Why it Matters Critically

Without coreference resolution, a machine reading the paragraph "Elon Musk founded Tesla. He later started SpaceX. The entrepreneur is now the richest person in the world." would treat "Elon Musk", "He", and "The entrepreneur" as three completely different people. Coreference resolution correctly merges them into a single entity cluster.

Key Terminology

Mention

Any noun phrase or pronoun in the text that could refer to an entity. Every "he", "she", "Amazon", "the company" is a candidate mention that needs to be resolved.

Antecedent

The earlier-occurring mention that a pronoun points back to. In "John ate his lunch", "John" is the antecedent of "his".

Coreference Chain

A complete cluster of all mentions that refer to the same entity.
Chain #1: {Elon Musk, He, The entrepreneur}.

A Worked Example

Input Text Analysis

"Amazon announced a new service today. The e-commerce giant said it will create 10,000 jobs. The company will begin hiring next quarter."

Resolved Coreference Chain

Amazon The e-commerce giant it The company

All 4 mentions correctly resolve to the same entity: Amazon.

Modern Approach: Neural Mention-Ranking

State-of-the-art coreference resolution uses a neural model (e.g. SpanBERT) that scores all possible pairs of mentions in a document to determine which ones are most likely to corefer. It ranks candidate antecedents for each mention and picks the best-scoring one.

Coreference Resolution with spaCy + neuralcoref

import spacy
import neuralcoref  # pip install neuralcoref

nlp = spacy.load('en_core_web_sm')
neuralcoref.add_to_pipe(nlp)

text = "Amazon announced a new service. The company said it will create 10,000 jobs."
doc = nlp(text)

if doc._.has_coref:
    print("Coreference Clusters Found:")
    for cluster in doc._.coref_clusters:
        print(f"  Cluster: {[str(m) for m in cluster.mentions]}")
        print(f"  Main: '{cluster.main}'")

# Output:
# Coreference Clusters Found:
#   Cluster: ['Amazon', 'The company', 'it']
# Main: 'Amazon'

Text Coherence

Understanding Text Coherence

While cohesion deals with the grammatical and lexical ties between individual sentences (e.g. pronouns), coherence refers to the logical and semantic flow of the entire document. A coherent text feels like a unified whole where every sentence contributes meaningfully to the overall theme.

Local Cohesion

How adjacent sentences connect using reference (pronouns), substitution, ellipsis, and conjunctions. It's the "surface level" connectivity.

Global Coherence

The high-level organization. Does the text follow a logical progression (e.g. Chronological, Problem-Solution, General-to-Specific)?

Computational Models of Coherence

Measuring coherence automatically is vital for tasks like automated essay grading and summarization evaluation. Key computational models include:

The Entity Grid Model

Introduced by Barzilay and Lapata (2008), this model represents a document as a grid where rows are sentences and columns are entities. Each cell tracks the grammatical role (Subject, Object, None) of an entity in a sentence.

Sentence	Elon Musk	Tesla	SpaceX
S1	Subject	Object	-
S2	Subject	-	Object
S3	Subject	-	-

A coherent text will show patterns of entity transitions (e.g., Subject â†’ Subject) that are statistically likely in well-written documents.

1. Centering Theory

Centering Theory (Grosz, Joshi & Weinstein, 1995) tracks the most prominent entity that is "in focus" as the reader moves from sentence to sentence. It predicts that coherence is high when:

The "center" (main entity) of a sentence is also the subject of the following sentence.
The center changes as little as possible between adjacent sentences.

Sentence 1: "The dog chased the ball." â†’ Center = Dog
Sentence 2: "It ran across the park." â†’ Center = Dog âœ… (High Coherence!)
Sentence 2': "The cat slept on the sofa." â†’ Center shifts to Cat âŒ (Low Coherence)

BERT-based Models: Fine-tuned on synthetic incoherence tasks (e.g., "detect the one shuffled sentence in an otherwise clean paragraph").

GPT Perplexity: A language model assigns a probability score to the text. Low-perplexity (high-probability) text â†’ High coherence.

Coherence Scoring via GPT-2 Perplexity

import torch
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

# Load GPT-2
model_id = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_id)
tokenizer = GPT2TokenizerFast.from_pretrained(model_id)
model.eval()

def perplexity(text):
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        loss = model(**inputs, labels=inputs["input_ids"]).loss
    return torch.exp(loss).item()

# Coherent paragraph
coherent = "The moon is Earth's only natural satellite. It formed 4.5 billion years ago. Scientists believe it resulted from a giant impact."

# Shuffled (incoherent) paragraph
incoherent = "Scientists believe it resulted from a giant impact. The moon is Earth's only natural satellite. It formed 4.5 billion years ago."

print(f"Coherent Perplexity:   {perplexity(coherent):.2f}  (LOWER = MORE COHERENT)")
print(f"Incoherent Perplexity: {perplexity(incoherent):.2f}  (HIGHER = LESS COHERENT)")

# Coherent Perplexity:   43.21  (LOWER = MORE COHERENT)
# Incoherent Perplexity: 89.56  (HIGHER = LESS COHERENT)