NLP Tutorial

Parsing Techniques

CFG, constituency parsing, dependency parsing, and shift-reduce parsing.

Context-Free Grammar

Context-Free Grammars (CFG)

A Context-Free Grammar (CFG) is a mathematical system of strict rewriting rules used to generate syntactically valid sentences in a language. It forms the backbone of Constituency Parsing.

It is called "context-free" because the left-hand side of every rule consists of a single non-terminal symbol, meaning the rule can be applied regardless of the surrounding context.

Anatomy of a CFG

A simple CFG for English contains rules written as A → B C (meaning "A consists of B and C").

Lexicon Rules (Emits Words) Det → "the" | "a"
Noun → "dog" | "cat"
Verb → "chased" | "saw"
Prep → "with" | "in"
Structural Rules (Combines phrases) S → NP VP
NP → Det Noun
VP → Verb NP

Generating a Sentence

Starting from the ROOT state S (Sentence), we apply the substitution rules from top to bottom:

1. S
2. NP VP
3. Det Noun VP
4. "the" "dog" VP
5. "the" "dog" Verb NP
6. "the" "dog" "chased" Det Noun
7. "The dog chased the cat."
The Problem with Hand-Coded CFGs: Linguistic ambiguity is massive. You can write 1,000 rules, and an English sentence will still find a way to be ambiguous (e.g., prepositional attachment: "I shot an elephant in my pajamas"). Therefore, modern NLP uses Probabilistic Context-Free Grammars (PCFGs), adding mathematical probabilities learned from data to every rule to pick the "most likely" valid tree.

Constituency Parsing

Constituency Parsing

Constituency Parsing (or Phrase-Structure Parsing) automatically organizes a sentence into nested structural groups using the rules defined by a Probabilistic Context-Free Grammar (PCFG). The output is a highly structured, tree-like hierarchy of phrases.

Visualizing the Tree Structure

In a constituency tree, root/interior nodes represent non-terminals (Phrases like NP, VP), and the leaf nodes are the actual text words of the sentence.

"The clever dog chased the ball"
S
  • └─ NP
    • └─ Det (The)
    • └─ Adj (clever)
    • └─ Noun (dog)
  • └─ VP
    • └─ Verb (chased)
    • └─ NP
      • └─ Det (the)
      • └─ Noun (ball)
Core Phrase Types
  • NP (Noun Phrase): The subject or object of a sentence. Contains the noun and its modifiers. e.g., "The big red ball"
  • VP (Verb Phrase): The predicate. It contains the verb and its dependents (like direct objects). e.g., "chased the ball across the yard"
  • PP (Prepositional Phrase): Contains a preposition and acts generally as an adverb or adjective. e.g., "into the box"
Common NLP Use Cases
  • Coreference Resolution: Finding noun phrases (NPs) that refer to the same entity.
  • Chomskyan Linguistics: Direct implementation of Chomsky's formal hierarchies.
  • Information Extraction: Extracting distinct semantic chunks from dense legal documents.
Visualizing Trees using NLTK CoreNLP
import nltk

# Assume we already have a parsed standard string
tree_string = "(S (NP (Det The) (Adj clever) (Noun dog)) (VP (Verb chased) (NP (Det the) (Noun ball))))"

# Parse the string into an NLTK tree object
tree = nltk.Tree.fromstring(tree_string)

# Print the readable ASCII hierarchy
print(tree)

# You can also draw actual graphical UI popups using:
# tree.draw()

Dependency Parsing

Dependency Parsing

Unlike Constituency Parsing (which builds nested phrase boxes), Dependency Parsing directly draws directional linkages between the words themselves. It maps out which word is the grammatical "Head" and which word is its "Dependent".

Dependency parsing is much more popular than constituency parsing in modern NLP (especially through spaCy) because it is faster, lighter, and easier to use for Information Extraction.

Head vs Dependent

Dependency links are directional (). In the phrase "clever dog":

  • "dog" is the Head word (ROOT concept).
  • "clever" is the Dependent word (it modifies the dog).
  • Link type: amod (Adjectival Modifier).
Visualizing Dependencies
The
det
clever
amod
dog
nsubj
chased ROOT
the
det
ball
dobj

Universal Dependency Tags

Tag Meaning Example Relation
nsubjNominal SubjectProvides the subject to the verb. (dog → chased)
dobjDirect ObjectThe entity acted upon. (chased → ball)
amodAdjectival ModifierModifies the noun. (clever → dog)
prepPrepositional ModifierLinks preposition to its head. (sat → on)
ROOTRoot of SentenceThe central active verb. (chased)
Extracting Dependencies with spaCy
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("The clever dog chased the ball.")

# Print Subject-Verb-Object relationships!
print(f"{'Text':<10} | {'Dependent Tag':<15} | {'Head Word':<10}")
print("-" * 40)
for token in doc:
    print(f"{token.text:<10} | {token.dep_:<15} | {token.head.text:<10}")

'''
Output:
Text       | Dependent Tag   | Head Word 
----------------------------------------
The        | det             | dog       
clever     | amod            | dog       
dog        | nsubj           | chased     <-- We found the subject!
chased     | ROOT            | chased     <-- Core verb of sentence
the        | det             | ball      
ball       | dobj            | chased     <-- We found the object!
.          | punct           | chased    
'''

Shift-Reduce Parsing

Transition-Based Parsing (Shift-Reduce)

How does an algorithm actually construct a Dependency tree in code? The most famous algorithm is the Shift-Reduce Parser. It reads words one by one and uses a stack data structure to draw grammatical arrows incrementally as it processes the sentence left-to-right.

This algorithm runs in highly efficient O(N) time complexity, the primary reason spaCy uses it for its lightning-fast dependency parser.

The Three Core Data Structures

  1. The Buffer: A queue containing all the words waiting to be read.
  2. The Stack: LIFO structure holding words currently being processed.
  3. The Output Memory: A list of the dependency linkages `(Head → Dependent)` mapped out so far.

The 3 Possible Actions

1. SHIFT

Move the first word from the Buffer onto the top of the Stack.

2. LEFT-ARC

Draw an arrow from the 2nd token on the stack to the 1st token on stack. Then Pop the 1st token.

3. RIGHT-ARC

Draw an arrow from the 1st token on the stack to the 2nd token on stack. Then Pop the 2nd token.

Example Walkthrough: "I saw John"
Step Action Taken Stack (Top is right) Buffer (Next is left) Link Created
0Initialize[ROOT]["I", "saw", "John"]None
1SHIFT[ROOT, "I"]["saw", "John"]None
2SHIFT[ROOT, "I", "saw"]["John"]None
3LEFT-ARC[ROOT, "saw"]["John"]saw → I (nsubj)
4SHIFT[ROOT, "saw", "John"][]None
5RIGHT-ARC[ROOT, "saw"][]saw → John (dobj)
6RIGHT-ARC[ROOT][]ROOT → saw (root)

Note: Originally, parsers used heuristic rules to decide which action to take. Modern parsers train a Neural Network classification model to look at the stack at any given frame and mathematically classify whether it should output Shift, Left, or Right!