Text Preprocessing
Text cleaning, tokenization, stop words, stemming, lemmatization, and POS tagging.
Text Cleaning
Text Cleaning and Preprocessing
Before any sophisticated NLP algorithm can be applied, raw text must be cleaned. Text cleaning is often the most time-consuming part of an NLP project, as real-world data is inherently messy.
Common Text Cleaning Techniques
1. Lowercasing
Converting all characters to lowercase ensures that "Apple" and "apple" are treated as the same word, reducing the total vocabulary size.
"I LOVE Python!" → "i love python!"2. Removing HTML or Markup
When scraping data from the web, removing HTML tags using libraries like BeautifulSoup or Regex is crucial so code snippets aren't treated as words.
"<p>Hello World</p>" → "Hello World"3. Removing Punctuation vs Keeping Punctuation
For simple frequency tasks (like Spam detection), punctuation adds noise. However, punctuation can change semantics dramatically:
"Let's eat Grandma!" (Cannibalism)
4. Expanding Contractions
It's beneficial to normalize language by expanding standard contractions.
"They're going." → "They are going."import re
def clean_text(text):
# 1. Convert to lowercase
text = text.lower()
# 2. Expand common contractions (simplified)
contractions = {"don't": "do not", "isn't": "is not", "you're": "you are"}
for word, replacement in contractions.items():
text = text.replace(word, replacement)
# 3. Remove HTML tags
text = re.sub(r'<.*?>', '', text)
# 4. Remove URLs
text = re.sub(r'http\S+|www\S+|https\S+', '', text)
# 5. Remove punctuation & special characters
text = re.sub(r'[^a-zA-Z\s]', '', text)
# 6. Remove extra whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
raw = "<h1>Wow! NLP is so COOL!!! don't click https://example.com 123</h1>"
print(clean_text(raw))
# Output: "wow nlp is so cool do not click"
Tokenization
Tokenization in NLP
Tokenization is the process of breaking down a stream of text into smaller, meaningful units called tokens. Why can't we just use the Python .split(' ') function? Because standard splitting fails on punctuation, contractions, and acronyms!
"Mr. O'Neill doesn't go." → ["Mr.", "O'Neill", "doesn't", "go."]NLP Tokenized: →
["Mr.", "O", "'", "Neill", "does", "n't", "go", "."]
Types of Tokenization
Sentence Tokenization
Splits paragraphs into sentences. Must be smart enough to know that "Dr." or "U.S.A." doesn't end a sentence.
Word Tokenization
Splitting sentences into words and independent punctuation marks like commas and periods.
Subword Tokenization (BPE)
Used in modern LLMs (BERT/GPT). Resolves "Out Of Vocabulary" errors by splitting rare words.
"Unfriendly" → ["un", "friend", "ly"]
Implementation Code
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Dr. Smith went to the U.S.A. Did he buy apples?"
# 1. Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentences:", sentences)
# Output: ['Dr. Smith went to the U.S.A.',
# 'Did he buy apples?']
# 2. Word Tokenization
words = word_tokenize(sentences[0])
print("\nWords:", words)
# Output: ['Dr.', 'Smith', 'went', 'to',
# 'the', 'U.S.A.']
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Dr. Smith went to the U.S.A."
# Process the text
doc = nlp(text)
# Tokenize
tokens = [token.text for token in doc]
print("spaCy Tokens:", tokens)
# Output: ['Dr.', 'Smith', 'went', 'to',
# 'the', 'U.S.A.', '.']
# Notice spaCy separates the final period!
Stop Words Removal
Stop Words Removal
Stop Words are the most common words in a language that typically do not add significant semantic meaning to a sentence. Words like "the", "is", "in", "and", "a" are classic examples.
Common English Stop Words
Why Remove Stop Words?
- Reduce Dataset Size: Stop words often take up 20-30% of the text data.
- Improve Performance: With less data, training models goes much faster.
- Focus on Meaningful Words: Algorithms can focus on the words that actually carry the core semantic meaning (like nouns and verbs).
Sometimes stop words are the ENTIRE meaning of the phrase! For example, Shakespeare's quote "To be, or not to be" consists 100% of stop words. If you remove them, the phrase becomes completely empty. Therefore, modern deep learning models (like Transformer/BERT) generally do not remove stop words.
Implementation and Customization
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Load english stopwords
stop_words = set(stopwords.words('english'))
# Customizing the stop words list
# 1. Removing a word (e.g. keeping 'not' for Sentiment Analysis)
stop_words.remove('not')
# 2. Adding domain-specific words (like 'http' for web scraping)
stop_words.add('http')
stop_words.add('www')
text = "The quick brown fox jumps over the lazy dog. it is not fast."
# 1. Tokenize first
word_tokens = word_tokenize(text)
# 2. Filter out the stop words
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
print("Filtered Tokens:", filtered_sentence)
# Output: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', '.', 'not', 'fast', '.']
# Notice how "The", "over", "the", "it", "is" were removed, but "not" was kept!
Stemming
Stemming Techniques
Stemming is a text normalization technique that reduces words to their base or root form by chopping off prefixes or suffixes according to a fixed set of rules. It is a heuristic process that operates purely on string manipulation.
Porter Stemmer
One of the oldest (1980) and most widely used suffix stripping algorithms. It uses 5 phases of word reduction.
- "ponies" → "poni"
- "caresses" → "caress"
Snowball Stemmer
Also known as the Porter2 stemmer. It is a slightly faster and more logical algorithm than the original Porter stemmer, supporting multiple languages.
Lancaster Stemmer
The most aggressive stemming algorithm. It is very fast but often chops words down to unreadable levels.
- "maximum" → "maxim"
When two words with different meanings are stemmed to the same root.
"universal", "university", "universe" → "univers"
When two words with the same meaning are stemmed to different roots.
"alumnus", "alumni" → "alumnus", "alumni"
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
porter = PorterStemmer()
snowball = SnowballStemmer("english")
lancaster = LancasterStemmer()
words = ["running", "generously", "history", "historical", "better", "universities"]
print(f"{'Word':<15} | {'Porter':<15} | {'Snowball':<15} | {'Lancaster':<15}")
print("-" * 65)
for w in words:
p_stem = porter.stem(w)
s_stem = snowball.stem(w)
l_stem = lancaster.stem(w)
print(f"{w:<15} | {p_stem:<15} | {s_stem:<15} | {l_stem:<15}")
# Output observation:
# 'better' stays 'better' across all (stemming fails with irregular verbs)
# 'universities' -> 'univers' (Porter/Snowball) -> 'univers' (Lancaster)
# 'historical' -> 'histor' (Porter/Snowball) -> 'hist' (Lancaster - highly aggressive)
Lemmatization
Lemmatization
Lemmatization, unlike Stemming, reduces words to their valid dictionary base form, known as the lemma. While stemming blindly chops off letters, lemmatization uses vocabulary and morphological analysis of words, referring to a dictionary like WordNet.
Stemming vs Lemmatization
| Original Word | Stemming | Lemmatization |
|---|---|---|
| Studying | studi | study |
| Better | better | good |
| Mice | mice | mouse |
| Was / Were | wa / were | be |
Lemmatization is computationally more expensive but yields much higher quality, readable text.
The Importance of POS Context
To correctly lemmatize an irregular word, the algorithm must know its Part-Of-Speech (POS) context. For example, look at the word "saw":
- "He saw a bird." (Verb) → lemma is see
- "He cut it with a saw." (Noun) → lemma is saw
NLTK's lemmatizer defaults to Noun if you don't provide the POS tag!
NLTK vs spaCy Lemmatization
In industry practice, spaCy is preferred for lemmatization because it automatically calculates POS tags in the background before applying lemmatization, avoiding manual tag definitions.
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Default POS is 'n' (Noun)
print(lemmatizer.lemmatize("better"))
# Output: better (Incorrect!)
# Changing POS to 'a' (Adjective)
print(lemmatizer.lemmatize("better", pos="a"))
# Output: good (Correct!)
# Changing POS to 'v' (Verb)
print(lemmatizer.lemmatize("running", pos="v"))
# Output: run (Correct!)
import spacy
# Load language model
nlp = spacy.load("en_core_web_sm")
text = "The mice were running better than before."
doc = nlp(text)
print(f"{'Word':<10} | {'Lemma':<10}")
print("-" * 25)
for token in doc:
print(f"{token.text:<10} | {token.lemma_:<10}")
# Output:
# mice | mouse
# were | be
# running | run
# better | well