Applications Section
Tutorial Section
Named Entity Recognition
Master Named Entity Recognition from BIO tagging to custom spaCy models.
Extracting "Real World" Entities
Named Entity Recognition (NER) is the task of identifying and categorizing key information (entities) in text into predefined categories such as names, organizations, locations, and dates.
How a Machine Sees Text:
"Elon MuskPERSON, the CEO of SpaceXORG, landed in TexasGPE on January 10thDATE."
Level 1 — The BIO Tagging Scheme
To identify multi-word entities (like "New York City"), machines use the Inside-Outside-Beginning (IOB/BIO) format.
| Word | Label | Description |
|---|---|---|
| New | B-LOC | Beginning of Location |
| York | I-LOC | Inside Location |
| is | O | Outside any entity |
| cold | O | Outside any entity |
Level 2 — Standard NER with spaCy
spaCy provides pre-trained models that are extremely accurate and industry-ready for 18+ entity types.
Python: spaCy NER
import spacy
# Load the small English model
nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying a U.K. startup for $1 billion in June 2024."
doc = nlp(text)
print(f"{'Text':<15} | {'Label':<10} | {'Description'}")
print("-" * 45)
for ent in doc.ents:
print(f"{ent.text:<15} | {ent.label_:<10} | {spacy.explain(ent.label_)}")
Level 3 — Custom NER Training
Note: Use custom training when standard models don't
recognize your industry-specific terms (e.g., Medical codes, Legal clauses).
Python: Training Format
import spacy
from spacy.tokens import DocBin
# Example training data format
TRAIN_DATA = [
("Search for IPhone 15 Pro", {"entities": [(11, 24, "PRODUCT")]}),
("Order a Samsung Galaxy S23", {"entities": [(8, 26, "PRODUCT")]})
]
# In modern spaCy (v3+), you typically use 'spacy train' command
# with a config file and .spacy data files created via DocBin.