Applications Section Tutorial Section

Named Entity Recognition

Master Named Entity Recognition from BIO tagging to custom spaCy models.

Extracting "Real World" Entities

Named Entity Recognition (NER) is the task of identifying and categorizing key information (entities) in text into predefined categories such as names, organizations, locations, and dates.

How a Machine Sees Text:

"Elon MuskPERSON, the CEO of SpaceXORG, landed in TexasGPE on January 10thDATE."

Level 1 — The BIO Tagging Scheme

To identify multi-word entities (like "New York City"), machines use the Inside-Outside-Beginning (IOB/BIO) format.

Word Label Description
New B-LOC Beginning of Location
York I-LOC Inside Location
is O Outside any entity
cold O Outside any entity

Level 2 — Standard NER with spaCy

spaCy provides pre-trained models that are extremely accurate and industry-ready for 18+ entity types.

Python: spaCy NER
import spacy

# Load the small English model
nlp = spacy.load("en_core_web_sm")

text = "Apple is looking at buying a U.K. startup for $1 billion in June 2024."

doc = nlp(text)

print(f"{'Text':<15} | {'Label':<10} | {'Description'}")
print("-" * 45)

for ent in doc.ents:
    print(f"{ent.text:<15} | {ent.label_:<10} | {spacy.explain(ent.label_)}")

Level 3 — Custom NER Training

Note: Use custom training when standard models don't recognize your industry-specific terms (e.g., Medical codes, Legal clauses).
Python: Training Format
import spacy
from spacy.tokens import DocBin

# Example training data format
TRAIN_DATA = [
    ("Search for IPhone 15 Pro", {"entities": [(11, 24, "PRODUCT")]}),
    ("Order a Samsung Galaxy S23", {"entities": [(8, 26, "PRODUCT")]})
]

# In modern spaCy (v3+), you typically use 'spacy train' command 
# with a config file and .spacy data files created via DocBin.