Tokenization
Tutorial Section
Tokenization
Master word, sentence, and subword tokenization with practical examples using NLTK and spaCy.
Tokenization in NLP
Tokenization is the process of breaking down a stream of text into smaller, meaningful units called tokens. Why can't we just use the Python .split(' ') function? Because standard splitting fails on punctuation, contractions, and acronyms!
Simple split
NLP Tokenized: →
"Mr. O'Neill doesn't go." → ["Mr.", "O'Neill", "doesn't", "go."]NLP Tokenized: →
["Mr.", "O", "'", "Neill", "does", "n't", "go", "."]
Types of Tokenization
Sentence Tokenization
Splits paragraphs into sentences. Must be smart enough to know that "Dr." or "U.S.A." doesn't end a sentence.
Word Tokenization
Splitting sentences into words and independent punctuation marks like commas and periods.
Subword Tokenization (BPE)
Used in modern LLMs (BERT/GPT). Resolves "Out Of Vocabulary" errors by splitting rare words.
"Unfriendly" → ["un", "friend", "ly"]
Implementation Code
1. Tokenization using NLTK
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Dr. Smith went to the U.S.A. Did he buy apples?"
# 1. Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentences:", sentences)
# Output: ['Dr. Smith went to the U.S.A.',
# 'Did he buy apples?']
# 2. Word Tokenization
words = word_tokenize(sentences[0])
print("\nWords:", words)
# Output: ['Dr.', 'Smith', 'went', 'to',
# 'the', 'U.S.A.']
2. Tokenization using spaCy
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Dr. Smith went to the U.S.A."
# Process the text
doc = nlp(text)
# Tokenize
tokens = [token.text for token in doc]
print("spaCy Tokens:", tokens)
# Output: ['Dr.', 'Smith', 'went', 'to',
# 'the', 'U.S.A.', '.']
# Notice spaCy separates the final period!