Parsing Techniques | Nikhil Learn Hub

Context-Free Grammar

Context-Free Grammars (CFG)

A Context-Free Grammar (CFG) is a mathematical system of strict rewriting rules used to generate syntactically valid sentences in a language. It forms the backbone of Constituency Parsing.

It is called "context-free" because the left-hand side of every rule consists of a single non-terminal symbol, meaning the rule can be applied regardless of the surrounding context.

Anatomy of a CFG

A simple CFG for English contains rules written as A → B C (meaning "A consists of B and C").

Lexicon Rules (Emits Words) Det → "the" | "a"
Noun → "dog" | "cat"
Verb → "chased" | "saw"
Prep → "with" | "in"

Structural Rules (Combines phrases) S → NP VP
NP → Det Noun
VP → Verb NP

Generating a Sentence

Starting from the ROOT state S (Sentence), we apply the substitution rules from top to bottom:

1. S

2. NP VP

3. Det Noun VP

4. "the" "dog" VP

5. "the" "dog" Verb NP

6. "the" "dog" "chased" Det Noun

7. "The dog chased the cat."

The Problem with Hand-Coded CFGs: Linguistic ambiguity is massive. You can write 1,000 rules, and an English sentence will still find a way to be ambiguous (e.g., prepositional attachment: "I shot an elephant in my pajamas"). Therefore, modern NLP uses Probabilistic Context-Free Grammars (PCFGs), adding mathematical probabilities learned from data to every rule to pick the "most likely" valid tree.

Constituency Parsing

Constituency Parsing (or Phrase-Structure Parsing) automatically organizes a sentence into nested structural groups using the rules defined by a Probabilistic Context-Free Grammar (PCFG). The output is a highly structured, tree-like hierarchy of phrases.

Visualizing the Tree Structure

In a constituency tree, root/interior nodes represent non-terminals (Phrases like NP, VP), and the leaf nodes are the actual text words of the sentence.

"The clever dog chased the ball"

S

└─ NP
- └─ Det (The)
- └─ Adj (clever)
- └─ Noun (dog)
└─ VP
- └─ Verb (chased)
- └─ NP
  - └─ Det (the)
  - └─ Noun (ball)

Core Phrase Types

NP (Noun Phrase): The subject or object of a sentence. Contains the noun and its modifiers. e.g., "The big red ball"
VP (Verb Phrase): The predicate. It contains the verb and its dependents (like direct objects). e.g., "chased the ball across the yard"
PP (Prepositional Phrase): Contains a preposition and acts generally as an adverb or adjective. e.g., "into the box"

Common NLP Use Cases

Coreference Resolution: Finding noun phrases (NPs) that refer to the same entity.
Chomskyan Linguistics: Direct implementation of Chomsky's formal hierarchies.
Information Extraction: Extracting distinct semantic chunks from dense legal documents.

Visualizing Trees using NLTK CoreNLP

import nltk

# Assume we already have a parsed standard string
tree_string = "(S (NP (Det The) (Adj clever) (Noun dog)) (VP (Verb chased) (NP (Det the) (Noun ball))))"

# Parse the string into an NLTK tree object
tree = nltk.Tree.fromstring(tree_string)

# Print the readable ASCII hierarchy
print(tree)

# You can also draw actual graphical UI popups using:
# tree.draw()

Dependency Parsing

Unlike Constituency Parsing (which builds nested phrase boxes), Dependency Parsing directly draws directional linkages between the words themselves. It maps out which word is the grammatical "Head" and which word is its "Dependent".

Dependency parsing is much more popular than constituency parsing in modern NLP (especially through spaCy) because it is faster, lighter, and easier to use for Information Extraction.

Head vs Dependent

Dependency links are directional (). In the phrase "clever dog":

"dog" is the Head word (ROOT concept).
"clever" is the Dependent word (it modifies the dog).
Link type: amod (Adjectival Modifier).

Visualizing Dependencies

The
det

clever
amod

dog
nsubj

chased ROOT

the
det

ball
dobj

Universal Dependency Tags

Tag	Meaning	Example Relation
nsubj	Nominal Subject	Provides the subject to the verb. (dog → chased)
dobj	Direct Object	The entity acted upon. (chased → ball)
amod	Adjectival Modifier	Modifies the noun. (clever → dog)
prep	Prepositional Modifier	Links preposition to its head. (sat → on)
ROOT	Root of Sentence	The central active verb. (chased)

Extracting Dependencies with spaCy

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("The clever dog chased the ball.")

# Print Subject-Verb-Object relationships!
print(f"{'Text':<10} | {'Dependent Tag':<15} | {'Head Word':<10}")
print("-" * 40)
for token in doc:
    print(f"{token.text:<10} | {token.dep_:<15} | {token.head.text:<10}")

'''
Output:
Text       | Dependent Tag   | Head Word 
----------------------------------------
The        | det             | dog       
clever     | amod            | dog       
dog        | nsubj           | chased     <-- We found the subject!
chased     | ROOT            | chased     <-- Core verb of sentence
the        | det             | ball      
ball       | dobj            | chased     <-- We found the object!
.          | punct           | chased    
'''

Shift-Reduce Parsing

Transition-Based Parsing (Shift-Reduce)

How does an algorithm actually construct a Dependency tree in code? The most famous algorithm is the Shift-Reduce Parser. It reads words one by one and uses a stack data structure to draw grammatical arrows incrementally as it processes the sentence left-to-right.

This algorithm runs in highly efficient O(N) time complexity, the primary reason spaCy uses it for its lightning-fast dependency parser.

                    The Three Core Data Structures
                    The Buffer: A queue containing all the words waiting to be read.
The Stack: LIFO structure holding words currently being processed.
The Output Memory: A list of the dependency linkages `(Head → Dependent)` mapped out so far.

                

The 3 Possible Actions

1. SHIFT

Move the first word from the Buffer onto the top of the Stack.

2. LEFT-ARC

Draw an arrow from the 2nd token on the stack to the 1st token on stack. Then Pop the 1st token.

3. RIGHT-ARC

Draw an arrow from the 1st token on the stack to the 2nd token on stack. Then Pop the 2nd token.

Example Walkthrough: "I saw John"

Step	Action Taken	Stack (Top is right)	Buffer (Next is left)	Link Created
0	Initialize	[ROOT]	["I", "saw", "John"]	None
1	SHIFT	[ROOT, "I"]	["saw", "John"]	None
2	SHIFT	[ROOT, "I", "saw"]	["John"]	None
3	LEFT-ARC	[ROOT, "saw"]	["John"]	saw → I (nsubj)
4	SHIFT	[ROOT, "saw", "John"]	[]	None
5	RIGHT-ARC	[ROOT, "saw"]	[]	saw → John (dobj)
6	RIGHT-ARC	[ROOT]	[]	ROOT → saw (root)

Note: Originally, parsers used heuristic rules to decide which action to take. Modern parsers train a Neural Network classification model to look at the stack at any given frame and mathematically classify whether it should output Shift, Left, or Right!