Coreference Resolution
Learn how NLP systems identify all expressions in a text that refer to the same real-world entity — a key step in reading comprehension.
Coreference Resolution Overview
Human writing is full of pronouns and abbreviated references. We constantly use words like "he", "she", "it", "they", "the company", and "the researcher" to refer back to entities already introduced in the text. Coreference Resolution is the task of finding all of these expressions that point to the same real-world entity and clustering them together.
Why it Matters Critically
Without coreference resolution, a machine reading the paragraph "Elon Musk founded Tesla. He later started SpaceX. The entrepreneur is now the richest person in the world." would treat "Elon Musk", "He", and "The entrepreneur" as three completely different people. Coreference resolution correctly merges them into a single entity cluster.
Key Terminology
Mention
Any noun phrase or pronoun in the text that could refer to an entity. Every "he", "she", "Amazon", "the company" is a candidate mention that needs to be resolved.
Antecedent
The earlier-occurring mention that a pronoun points back to. In "John ate his lunch", "John" is the antecedent of "his".
Coreference Chain
A complete cluster of all mentions that refer to the same entity.
Chain #1: {Elon Musk, He, The entrepreneur}.
A Worked Example
Input Text Analysis
"Amazon announced a new service today. The e-commerce giant said it will create 10,000 jobs. The company will begin hiring next quarter."
Resolved Coreference Chain
All 4 mentions correctly resolve to the same entity: Amazon.
Modern Approach: Neural Mention-Ranking
State-of-the-art coreference resolution uses a neural model (e.g. SpanBERT) that scores all possible pairs of mentions in a document to determine which ones are most likely to corefer. It ranks candidate antecedents for each mention and picks the best-scoring one.
import spacy
import neuralcoref # pip install neuralcoref
nlp = spacy.load('en_core_web_sm')
neuralcoref.add_to_pipe(nlp)
text = "Amazon announced a new service. The company said it will create 10,000 jobs."
doc = nlp(text)
if doc._.has_coref:
print("Coreference Clusters Found:")
for cluster in doc._.coref_clusters:
print(f" Cluster: {[str(m) for m in cluster.mentions]}")
print(f" Main: '{cluster.main}'")
# Output:
# Coreference Clusters Found:
# Cluster: ['Amazon', 'The company', 'it']
# Main: 'Amazon'