NLP Tutorial

Information Extraction

Named entity recognition and structured information extraction from text.

Named Entity Recognition

Extracting "Real World" Entities

Named Entity Recognition (NER) is the task of identifying and categorizing key information (entities) in text into predefined categories such as names, organizations, locations, and dates.

How a Machine Sees Text:

"Elon MuskPERSON, the CEO of SpaceXORG, landed in TexasGPE on January 10thDATE."

Level 1 — The BIO Tagging Scheme

To identify multi-word entities (like "New York City"), machines use the Inside-Outside-Beginning (IOB/BIO) format.

Word	Label	Description
New	B-LOC	Beginning of Location
York	I-LOC	Inside Location
is	O	Outside any entity
cold	O	Outside any entity

Level 2 — Standard NER with spaCy

spaCy provides pre-trained models that are extremely accurate and industry-ready for 18+ entity types.

Python: spaCy NER

import spacy

# Load the small English model
nlp = spacy.load("en_core_web_sm")

text = "Apple is looking at buying a U.K. startup for $1 billion in June 2024."

doc = nlp(text)

print(f"{'Text':<15} | {'Label':<10} | {'Description'}")
print("-" * 45)

for ent in doc.ents:
    print(f"{ent.text:<15} | {ent.label_:<10} | {spacy.explain(ent.label_)}")

Level 3 — Custom NER Training

Note: Use custom training when standard models don't recognize your industry-specific terms (e.g., Medical codes, Legal clauses).

Python: Training Format

import spacy
from spacy.tokens import DocBin

# Example training data format
TRAIN_DATA = [
    ("Search for IPhone 15 Pro", {"entities": [(11, 24, "PRODUCT")]}),
    ("Order a Samsung Galaxy S23", {"entities": [(8, 26, "PRODUCT")]})
]

# In modern spaCy (v3+), you typically use 'spacy train' command 
# with a config file and .spacy data files created via DocBin.

Information Extraction

Information Extraction (IE)

Information Extraction refers to the automated process of retrieving structured data (e.g., entities, relationships, attributes) from unstructured natural language text. It is crucial for building knowledge graphs and databases from text.

Core Subtasks of IE

Named Entity Recognition (NER): Identifying entities like Person, Org, Date.
Relation Extraction (RE): Identifying semantic relationships between entities (e.g., "Elon Musk" CEO_OF "SpaceX").
Event Extraction: Identifying events, participants, times, and locations (e.g., A "Company Acquisition" event).

Level 1 — Rule-Based Extraction (Regex)

For highly structured fields like dates, emails, or phone numbers, Regular Expressions remain the fastest and most reliable extraction tool.

Regex Email Extraction

import re

text = ("Please contact our support team at help@company.com or "
        "reach out to our CEO directly at ceo.name@enterprise.org.")

# Regex pattern for basic email addresses
email_pattern = r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+'

emails_found = re.findall(email_pattern, text)
print("Extracted Emails:", emails_found)
# Output: ['help@company.com', 'ceo.name@enterprise.org.']

Level 2 — Relation Extraction using Dependency Parsing

By analyzing the grammatical structure of a sentence (syntax tree), we can extract "Subject-Verb-Object" triples to form basic relation data.

Subject-Verb-Object Extraction using SpaCy

import spacy
# python -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')

text = "Microsoft acquired LinkedIn for $26 billion."
doc = nlp(text)

relations = []
for token in doc:
    # Look for the main verb
    if token.pos_ == "VERB":
        subject = [w for w in token.lefts if w.dep_ == 'nsubj']
        direct_object = [w for w in token.rights if w.dep_ in ['dobj', 'pobj']]
        
        if subject and direct_object:
            rel = (subject[0].text, token.lemma_, direct_object[0].text)
            relations.append(rel)

print("Extracted Relations:", relations)
# Output: [('Microsoft', 'acquire', 'LinkedIn')]

Level 3 — Open Information Extraction (OpenIE)

OpenIE systems automatically discover any relationships mentioned in text without requiring a predefined ontology of relation types.

Stanford OpenIE via Stanza

import stanza

# Requires downloading the Stanford CoreNLP English models
nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos,lemma,depparse,openie')

text = "Barack Obama was born in Hawaii and later won the Nobel Peace Prize."
doc = nlp(text)

print("OpenIE Triples:")
for sentence in doc.sentences:
    for triple in sentence.openie:
        print(f"Subject: {triple['subject']}")
        print(f"Relation: {triple['relation']}")
        print(f"Object: {triple['object']}
")
        
# Example Extracted Facts: 
# (Barack Obama, was born in, Hawaii)
# (Barack Obama, won, the Nobel Peace Prize)