Tokenization Tutorial Section

Tokenization

Master word, sentence, and subword tokenization with practical examples using NLTK and spaCy.

Tokenization in NLP

Tokenization is the process of breaking down a stream of text into smaller, meaningful units called tokens. Why can't we just use the Python .split(' ') function? Because standard splitting fails on punctuation, contractions, and acronyms!

Simple split "Mr. O'Neill doesn't go." → ["Mr.", "O'Neill", "doesn't", "go."]
NLP Tokenized: → ["Mr.", "O", "'", "Neill", "does", "n't", "go", "."]

Types of Tokenization

Sentence Tokenization

Splits paragraphs into sentences. Must be smart enough to know that "Dr." or "U.S.A." doesn't end a sentence.

Word Tokenization

Splitting sentences into words and independent punctuation marks like commas and periods.

Subword Tokenization (BPE)

Used in modern LLMs (BERT/GPT). Resolves "Out Of Vocabulary" errors by splitting rare words.

"Unfriendly" → ["un", "friend", "ly"]

Implementation Code

1. Tokenization using NLTK

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Dr. Smith went to the U.S.A. Did he buy apples?"

# 1. Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentences:", sentences)
# Output: ['Dr. Smith went to the U.S.A.', 
#          'Did he buy apples?']

# 2. Word Tokenization
words = word_tokenize(sentences[0])
print("\nWords:", words)
# Output: ['Dr.', 'Smith', 'went', 'to', 
#          'the', 'U.S.A.']

2. Tokenization using spaCy

import spacy

nlp = spacy.load("en_core_web_sm")
text = "Dr. Smith went to the U.S.A."

# Process the text
doc = nlp(text)

# Tokenize
tokens = [token.text for token in doc]
print("spaCy Tokens:", tokens)

# Output: ['Dr.', 'Smith', 'went', 'to', 
#          'the', 'U.S.A.', '.']
# Notice spaCy separates the final period!

Previous: Text Cleaning