Information Extraction
Tutorial
Information Extraction
Extract structured facts from raw text.
Information Extraction (IE)
Information Extraction refers to the automated process of retrieving structured data (e.g., entities, relationships, attributes) from unstructured natural language text. It is crucial for building knowledge graphs and databases from text.
Core Subtasks of IE
- Named Entity Recognition (NER): Identifying entities like Person, Org, Date.
- Relation Extraction (RE): Identifying semantic relationships between entities (e.g., "Elon Musk" CEO_OF "SpaceX").
- Event Extraction: Identifying events, participants, times, and locations (e.g., A "Company Acquisition" event).
Level 1 — Rule-Based Extraction (Regex)
For highly structured fields like dates, emails, or phone numbers, Regular Expressions remain the fastest and most reliable extraction tool.
Regex Email Extraction
import re
text = ("Please contact our support team at help@company.com or "
"reach out to our CEO directly at ceo.name@enterprise.org.")
# Regex pattern for basic email addresses
email_pattern = r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+'
emails_found = re.findall(email_pattern, text)
print("Extracted Emails:", emails_found)
# Output: ['help@company.com', 'ceo.name@enterprise.org.']
Level 2 — Relation Extraction using Dependency Parsing
By analyzing the grammatical structure of a sentence (syntax tree), we can extract "Subject-Verb-Object" triples to form basic relation data.
Subject-Verb-Object Extraction using SpaCy
import spacy
# python -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')
text = "Microsoft acquired LinkedIn for $26 billion."
doc = nlp(text)
relations = []
for token in doc:
# Look for the main verb
if token.pos_ == "VERB":
subject = [w for w in token.lefts if w.dep_ == 'nsubj']
direct_object = [w for w in token.rights if w.dep_ in ['dobj', 'pobj']]
if subject and direct_object:
rel = (subject[0].text, token.lemma_, direct_object[0].text)
relations.append(rel)
print("Extracted Relations:", relations)
# Output: [('Microsoft', 'acquire', 'LinkedIn')]
Level 3 — Open Information Extraction (OpenIE)
OpenIE systems automatically discover any relationships mentioned in text without requiring a predefined ontology of relation types.
Stanford OpenIE via Stanza
import stanza
# Requires downloading the Stanford CoreNLP English models
nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos,lemma,depparse,openie')
text = "Barack Obama was born in Hawaii and later won the Nobel Peace Prize."
doc = nlp(text)
print("OpenIE Triples:")
for sentence in doc.sentences:
for triple in sentence.openie:
print(f"Subject: {triple['subject']}")
print(f"Relation: {triple['relation']}")
print(f"Object: {triple['object']}
")
# Example Extracted Facts:
# (Barack Obama, was born in, Hawaii)
# (Barack Obama, won, the Nobel Peace Prize)