Discourse Analysis
Go beyond individual sentences: understand how paragraphs and multi-sentence texts create structured meaning through discourse relations.
Discourse Analysis Overview
All the NLP techniques we've studied so far operate primarily at the sentence level. Discourse Analysis zooms out to study how sentences connect and relate to form coherent paragraphs, conversations, documents, and entire texts.
A sequence of grammatically perfect individual sentences does not automatically make a coherent text. Discourse analysis identifies the hidden logical glue that holds connected text together.
"Roses are red. Quantum mechanics describes particle physics. My cat is named Whiskers."
Three perfectly valid sentences. Zero discourse coherence. A good discourse model will assign this a very low coherence score.
Rhetorical Structure Theory (RST): The Backbone of Discourse
RST is the most influential theory for computational discourse analysis. It proposes that coherent texts can be represented as a hierarchical tree of nuclei and satellites linked by specific rhetorical relations.
Nucleus
The core element — the most essential piece of information. If removed, the text loses its main point. In the sentence pair "The system crashed [N] because of a memory overflow [S]", the Nucleus is the crash event.
Satellite
The supporting element — it elaborates or fills in context around the nucleus. It helps the nucleus but is not itself the main point. The cause ("memory overflow") is the Satellite.
Common Rhetorical (Discourse) Relations
| Relation | Meaning | Connecting Word Example |
|---|---|---|
| CAUSE | Satellite is the reason for the Nucleus event. | "because", "due to" |
| CONTRAST | Two nuclei are presented as opposing ideas. | "however", "but", "whereas" |
| ELABORATION | Satellite gives more detail about the Nucleus. | "specifically", "for example" |
| EVIDENCE | Satellite provides factual support for the Nucleus claim. | "as shown by", "data indicates" |
| CONCESSION | Satellite acknowledges something that seems to conflict with Nucleus. | "although", "even though" |
| CONDITION | Nucleus event is conditional upon the Satellite. | "if", "provided that", "unless" |
Discourse Segmentation: EDUs
The first step in computational discourse analysis is breaking text into the smallest possible meaning-bearing units called Elementary Discourse Units (EDUs). These are typically individual clauses.
Segmenting into EDUs
"The company's profits fell sharply last year, largely because they failed to innovate, and subsequently they had to lay off 500 employees."
Segmented into 3 EDUs:
- EDU 1 "The company's profits fell sharply last year,"
- EDU 2 "largely because they failed to innovate,"
- EDU 3 "and subsequently they had to lay off 500 employees."
Relations: EDU2 CAUSE → EDU1; EDU3 is the RESULT of EDU1.