spaCy Tutorial

spaCy

Industrial-strength natural language processing optimized for production speed and efficiency.

spaCy: Industrial-Strength NLP

Unlike NLTK which was built for teaching, spaCy was built specifically for production software. Written in Cython for memory management and blazing speed, it provides a single, highly-optimized algorithm for each task rather than offering a menu of choices.

spaCy excels at large-scale information extraction tasks, providing pre-trained statistical neural network models for over 23 languages.

Level 1 — The Object-Oriented Pipeline

In spaCy, you load a language model which creates a processing nlp pipeline. When you pass text through this pipeline, spaCy instantly tokenizes, POS-tags, lemmatizes, and parses dependencies, returning a rich Doc object.

Downloading Models

Before running spaCy scripts, you must download a pre-trained model for your language via the terminal:

python -m spacy download en_core_web_sm (Small English Model ~12MB)
python -m spacy download en_core_web_trf (Transformer-based English Model ~400MB)
Processing Text into a Doc Object
import spacy
import pandas as pd

# Load the small English model pipeline
nlp = spacy.load("en_core_web_sm")

text = "Apple is looking at buying U.K. startup for $1 billion."
# Process the text - this instantly runs Tokenizer, Tagger, Parser, NER, and Lemmatizer!
doc = nlp(text)

# We can easily extract the rich linguistic data into a DataFrame for viewing
token_data = []
for token in doc:
    token_data.append([
        token.text,        # The exact text
        token.lemma_,      # The root form (buying -> buy)
        token.pos_,        # Part of speech (VERB, PROPN)
        token.dep_,        # Syntactic dependency (Subject, Object)
        token.is_stop,     # Is it a stopword? (True/False)
        token.is_alpha     # Is it alphabetical? (True/False)
    ])

df = pd.DataFrame(token_data, columns=['Text', 'Lemma', 'POS', 'Dependency', 'Is Stop', 'Is Alpha'])
print(df.head(6))

Level 2 — Named Entity Recognition (NER)

One of spaCy's absolute strongest features out-of-the-box is its highly accurate Named Entity Recognizer, which can identify people, places, organizations, money, dates, and more.

Extracting Entities
text2 = "On July 20, 1969, Neil Armstrong walked on the Moon. NASA spent roughly $25 billion on the Apollo program."
doc2 = nlp(text2)

print(f"{'Entity Text':<20} | {'Label':<10} | {'Explanation'}")
print("-" * 60)
for ent in doc2.ents:
    # ent.label_ gives the code (e.g. 'GPE'), spacy.explain() translates it to human text
    print(f"{ent.text:<20} | {ent.label_:<10} | {spacy.explain(ent.label_)}")

# Output:
# July 20, 1969        | DATE       | Absolute or relative dates or periods
# Neil Armstrong       | PERSON     | People, including fictional
# the Moon             | LOC        | Non-GPE locations, mountain ranges, bodies of water
# NASA                 | ORG        | Companies, agencies, institutions, etc.
# roughly $25 billion  | MONEY      | Monetary values, including unit

Level 3 — Custom Rule-Based Matching

While Regex operates on raw strings, spaCy's Matcher operates on Doc objects and token attributes. This allows you to write powerful rules based on grammar rather than just character sequences.

Grammar-Aware Pattern Matching
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

# Let's find patterns where a verb is followed by a specific pronoun
# e.g., "loved it", "hated him", "saw them"
pattern = [
    {"POS": "VERB"},             # Match any verb
    {"POS": "PRON"}              # Match any pronoun immediately after
]

matcher.add("VERB_PRON_PATTERN", [pattern])

doc3 = nlp("I loved it! But my brother absolutely hated him for what he did.")
matches = matcher(doc3)

for match_id, start, end in matches:
    matched_span = doc3[start:end]
    print(f"Found match: '{matched_span.text}'")
    
# Found match: 'loved it'
# Found match: 'hated him'
Displacy Visualization: spaCy includes a built-in submodule called spacy.displacy that can render beautiful HTML visuals of syntax dependency trees and highlighted named entities directly in your browser or Jupyter Notebook!