Parsing Techniques
CFG, constituency parsing, dependency parsing, and shift-reduce parsing.
Context-Free Grammar
Context-Free Grammars (CFG)
A Context-Free Grammar (CFG) is a mathematical system of strict rewriting rules used to generate syntactically valid sentences in a language. It forms the backbone of Constituency Parsing.
It is called "context-free" because the left-hand side of every rule consists of a single non-terminal symbol, meaning the rule can be applied regardless of the surrounding context.
Anatomy of a CFG
A simple CFG for English contains rules written as A → B C (meaning "A consists of B and C").
Noun → "dog" | "cat"
Verb → "chased" | "saw"
Prep → "with" | "in"
NP → Det Noun
VP → Verb NP
Generating a Sentence
Starting from the ROOT state S (Sentence), we apply the substitution rules from top to bottom:
Constituency Parsing
Constituency Parsing
Constituency Parsing (or Phrase-Structure Parsing) automatically organizes a sentence into nested structural groups using the rules defined by a Probabilistic Context-Free Grammar (PCFG). The output is a highly structured, tree-like hierarchy of phrases.
Visualizing the Tree Structure
In a constituency tree, root/interior nodes represent non-terminals (Phrases like NP, VP), and the leaf nodes are the actual text words of the sentence.
"The clever dog chased the ball"
- └─ NP
- └─ Det (The)
- └─ Adj (clever)
- └─ Noun (dog)
- └─ VP
- └─ Verb (chased)
- └─ NP
- └─ Det (the)
- └─ Noun (ball)
Core Phrase Types
- NP (Noun Phrase): The subject or object of a sentence. Contains the noun and its modifiers. e.g., "The big red ball"
- VP (Verb Phrase): The predicate. It contains the verb and its dependents (like direct objects). e.g., "chased the ball across the yard"
- PP (Prepositional Phrase): Contains a preposition and acts generally as an adverb or adjective. e.g., "into the box"
Common NLP Use Cases
- Coreference Resolution: Finding noun phrases (NPs) that refer to the same entity.
- Chomskyan Linguistics: Direct implementation of Chomsky's formal hierarchies.
- Information Extraction: Extracting distinct semantic chunks from dense legal documents.
import nltk
# Assume we already have a parsed standard string
tree_string = "(S (NP (Det The) (Adj clever) (Noun dog)) (VP (Verb chased) (NP (Det the) (Noun ball))))"
# Parse the string into an NLTK tree object
tree = nltk.Tree.fromstring(tree_string)
# Print the readable ASCII hierarchy
print(tree)
# You can also draw actual graphical UI popups using:
# tree.draw()
Dependency Parsing
Dependency Parsing
Unlike Constituency Parsing (which builds nested phrase boxes), Dependency Parsing directly draws directional linkages between the words themselves. It maps out which word is the grammatical "Head" and which word is its "Dependent".
Dependency parsing is much more popular than constituency parsing in modern NLP (especially through spaCy) because it is faster, lighter, and easier to use for Information Extraction.
Head vs Dependent
Dependency links are directional (). In the phrase "clever dog":
- "dog" is the Head word (ROOT concept).
- "clever" is the Dependent word (it modifies the dog).
- Link type:
amod(Adjectival Modifier).
Visualizing Dependencies
det
amod
nsubj
det
dobj
Universal Dependency Tags
| Tag | Meaning | Example Relation |
|---|---|---|
| nsubj | Nominal Subject | Provides the subject to the verb. (dog → chased) |
| dobj | Direct Object | The entity acted upon. (chased → ball) |
| amod | Adjectival Modifier | Modifies the noun. (clever → dog) |
| prep | Prepositional Modifier | Links preposition to its head. (sat → on) |
| ROOT | Root of Sentence | The central active verb. (chased) |
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The clever dog chased the ball.")
# Print Subject-Verb-Object relationships!
print(f"{'Text':<10} | {'Dependent Tag':<15} | {'Head Word':<10}")
print("-" * 40)
for token in doc:
print(f"{token.text:<10} | {token.dep_:<15} | {token.head.text:<10}")
'''
Output:
Text | Dependent Tag | Head Word
----------------------------------------
The | det | dog
clever | amod | dog
dog | nsubj | chased <-- We found the subject!
chased | ROOT | chased <-- Core verb of sentence
the | det | ball
ball | dobj | chased <-- We found the object!
. | punct | chased
'''
Shift-Reduce Parsing
Transition-Based Parsing (Shift-Reduce)
How does an algorithm actually construct a Dependency tree in code? The most famous algorithm is the Shift-Reduce Parser. It reads words one by one and uses a stack data structure to draw grammatical arrows incrementally as it processes the sentence left-to-right.
This algorithm runs in highly efficient O(N) time complexity, the primary reason spaCy uses it for its lightning-fast dependency parser.
The Three Core Data Structures
- The Buffer: A queue containing all the words waiting to be read.
- The Stack: LIFO structure holding words currently being processed.
- The Output Memory: A list of the dependency linkages `(Head → Dependent)` mapped out so far.
The 3 Possible Actions
1. SHIFT
Move the first word from the Buffer onto the top of the Stack.
2. LEFT-ARC
Draw an arrow from the 2nd token on the stack to the 1st token on stack. Then Pop the 1st token.
3. RIGHT-ARC
Draw an arrow from the 1st token on the stack to the 2nd token on stack. Then Pop the 2nd token.
Example Walkthrough: "I saw John"
| Step | Action Taken | Stack (Top is right) | Buffer (Next is left) | Link Created |
|---|---|---|---|---|
| 0 | Initialize | [ROOT] | ["I", "saw", "John"] | None |
| 1 | SHIFT | [ROOT, "I"] | ["saw", "John"] | None |
| 2 | SHIFT | [ROOT, "I", "saw"] | ["John"] | None |
| 3 | LEFT-ARC | [ROOT, "saw"] | ["John"] | saw → I (nsubj) |
| 4 | SHIFT | [ROOT, "saw", "John"] | [] | None |
| 5 | RIGHT-ARC | [ROOT, "saw"] | [] | saw → John (dobj) |
| 6 | RIGHT-ARC | [ROOT] | [] | ROOT → saw (root) |
Note: Originally, parsers used heuristic rules to decide which action to take. Modern parsers train a Neural Network classification model to look at the stack at any given frame and mathematically classify whether it should output Shift, Left, or Right!