Text Coherence Tutorial Section

Text Coherence

Explore what makes a text smooth and readable: from local cohesion (sentence-to-sentence) to global coherence (overall document structure).

Understanding Text Coherence

While cohesion deals with the grammatical and lexical ties between individual sentences (e.g. pronouns), coherence refers to the logical and semantic flow of the entire document. A coherent text feels like a unified whole where every sentence contributes meaningfully to the overall theme.

Local Cohesion

How adjacent sentences connect using reference (pronouns), substitution, ellipsis, and conjunctions. It's the "surface level" connectivity.

Global Coherence

The high-level organization. Does the text follow a logical progression (e.g. Chronological, Problem-Solution, General-to-Specific)?

Computational Models of Coherence

Measuring coherence automatically is vital for tasks like automated essay grading and summarization evaluation. Key computational models include:

The Entity Grid Model

Introduced by Barzilay and Lapata (2008), this model represents a document as a grid where rows are sentences and columns are entities. Each cell tracks the grammatical role (Subject, Object, None) of an entity in a sentence.

Sentence Elon Musk Tesla SpaceX
S1 Subject Object -
S2 Subject - Object
S3 Subject - -

A coherent text will show patterns of entity transitions (e.g., Subject → Subject) that are statistically likely in well-written documents.

1. Centering Theory

Centering Theory (Grosz, Joshi & Weinstein, 1995) tracks the most prominent entity that is "in focus" as the reader moves from sentence to sentence. It predicts that coherence is high when:

  • The "center" (main entity) of a sentence is also the subject of the following sentence.
  • The center changes as little as possible between adjacent sentences.
Sentence 1: "The dog chased the ball." → Center = Dog
Sentence 2: "It ran across the park." → Center = Dog ✅ (High Coherence!)
Sentence 2': "The cat slept on the sofa." → Center shifts to Cat ❌ (Low Coherence)
  • BERT-based Models: Fine-tuned on synthetic incoherence tasks (e.g., "detect the one shuffled sentence in an otherwise clean paragraph").
  • GPT Perplexity: A language model assigns a probability score to the text. Low-perplexity (high-probability) text → High coherence.
  • Coherence Scoring via GPT-2 Perplexity
    import torch
    from transformers import GPT2LMHeadModel, GPT2TokenizerFast
    
    # Load GPT-2
    model_id = "gpt2"
    model = GPT2LMHeadModel.from_pretrained(model_id)
    tokenizer = GPT2TokenizerFast.from_pretrained(model_id)
    model.eval()
    
    def perplexity(text):
        inputs = tokenizer(text, return_tensors="pt")
        with torch.no_grad():
            loss = model(**inputs, labels=inputs["input_ids"]).loss
        return torch.exp(loss).item()
    
    # Coherent paragraph
    coherent = "The moon is Earth's only natural satellite. It formed 4.5 billion years ago. Scientists believe it resulted from a giant impact."
    
    # Shuffled (incoherent) paragraph
    incoherent = "Scientists believe it resulted from a giant impact. The moon is Earth's only natural satellite. It formed 4.5 billion years ago."
    
    print(f"Coherent Perplexity:   {perplexity(coherent):.2f}  (LOWER = MORE COHERENT)")
    print(f"Incoherent Perplexity: {perplexity(incoherent):.2f}  (HIGHER = LESS COHERENT)")
    
    # Coherent Perplexity:   43.21  (LOWER = MORE COHERENT)
    # Incoherent Perplexity: 89.56  (HIGHER = LESS COHERENT)