Information Extraction
Named entity recognition and structured information extraction from text.
Named Entity Recognition
Extracting "Real World" Entities
Named Entity Recognition (NER) is the task of identifying and categorizing key information (entities) in text into predefined categories such as names, organizations, locations, and dates.
How a Machine Sees Text:
"Elon MuskPERSON, the CEO of SpaceXORG, landed in TexasGPE on January 10thDATE."
Level 1 — The BIO Tagging Scheme
To identify multi-word entities (like "New York City"), machines use the Inside-Outside-Beginning (IOB/BIO) format.
| Word | Label | Description |
|---|---|---|
| New | B-LOC | Beginning of Location |
| York | I-LOC | Inside Location |
| is | O | Outside any entity |
| cold | O | Outside any entity |
Level 2 — Standard NER with spaCy
spaCy provides pre-trained models that are extremely accurate and industry-ready for 18+ entity types.
import spacy
# Load the small English model
nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying a U.K. startup for $1 billion in June 2024."
doc = nlp(text)
print(f"{'Text':<15} | {'Label':<10} | {'Description'}")
print("-" * 45)
for ent in doc.ents:
print(f"{ent.text:<15} | {ent.label_:<10} | {spacy.explain(ent.label_)}")
Level 3 — Custom NER Training
import spacy
from spacy.tokens import DocBin
# Example training data format
TRAIN_DATA = [
("Search for IPhone 15 Pro", {"entities": [(11, 24, "PRODUCT")]}),
("Order a Samsung Galaxy S23", {"entities": [(8, 26, "PRODUCT")]})
]
# In modern spaCy (v3+), you typically use 'spacy train' command
# with a config file and .spacy data files created via DocBin.
Information Extraction
Information Extraction (IE)
Information Extraction refers to the automated process of retrieving structured data (e.g., entities, relationships, attributes) from unstructured natural language text. It is crucial for building knowledge graphs and databases from text.
Core Subtasks of IE
- Named Entity Recognition (NER): Identifying entities like Person, Org, Date.
- Relation Extraction (RE): Identifying semantic relationships between entities (e.g., "Elon Musk" CEO_OF "SpaceX").
- Event Extraction: Identifying events, participants, times, and locations (e.g., A "Company Acquisition" event).
Level 1 — Rule-Based Extraction (Regex)
For highly structured fields like dates, emails, or phone numbers, Regular Expressions remain the fastest and most reliable extraction tool.
import re
text = ("Please contact our support team at help@company.com or "
"reach out to our CEO directly at ceo.name@enterprise.org.")
# Regex pattern for basic email addresses
email_pattern = r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+'
emails_found = re.findall(email_pattern, text)
print("Extracted Emails:", emails_found)
# Output: ['help@company.com', 'ceo.name@enterprise.org.']
Level 2 — Relation Extraction using Dependency Parsing
By analyzing the grammatical structure of a sentence (syntax tree), we can extract "Subject-Verb-Object" triples to form basic relation data.
import spacy
# python -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')
text = "Microsoft acquired LinkedIn for $26 billion."
doc = nlp(text)
relations = []
for token in doc:
# Look for the main verb
if token.pos_ == "VERB":
subject = [w for w in token.lefts if w.dep_ == 'nsubj']
direct_object = [w for w in token.rights if w.dep_ in ['dobj', 'pobj']]
if subject and direct_object:
rel = (subject[0].text, token.lemma_, direct_object[0].text)
relations.append(rel)
print("Extracted Relations:", relations)
# Output: [('Microsoft', 'acquire', 'LinkedIn')]
Level 3 — Open Information Extraction (OpenIE)
OpenIE systems automatically discover any relationships mentioned in text without requiring a predefined ontology of relation types.
import stanza
# Requires downloading the Stanford CoreNLP English models
nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos,lemma,depparse,openie')
text = "Barack Obama was born in Hawaii and later won the Nobel Peace Prize."
doc = nlp(text)
print("OpenIE Triples:")
for sentence in doc.sentences:
for triple in sentence.openie:
print(f"Subject: {triple['subject']}")
print(f"Relation: {triple['relation']}")
print(f"Object: {triple['object']}
")
# Example Extracted Facts:
# (Barack Obama, was born in, Hawaii)
# (Barack Obama, won, the Nobel Peace Prize)