Text Preprocessing

Text Cleaning

Text Cleaning and Preprocessing

Before any sophisticated NLP algorithm can be applied, raw text must be cleaned. Text cleaning is often the most time-consuming part of an NLP project, as real-world data is inherently messy.

Rule of Thumb: Garbage In, Garbage Out. The quality of your text cleaning pipeline directly dictates the upper limit of your model's final performance!

Common Text Cleaning Techniques

1. Lowercasing

Converting all characters to lowercase ensures that "Apple" and "apple" are treated as the same word, reducing the total vocabulary size.

"I LOVE Python!" → "i love python!"

2. Removing HTML or Markup

When scraping data from the web, removing HTML tags using libraries like BeautifulSoup or Regex is crucial so code snippets aren't treated as words.

"<p>Hello World</p>" → "Hello World"

3. Removing Punctuation vs Keeping Punctuation

For simple frequency tasks (like Spam detection), punctuation adds noise. However, punctuation can change semantics dramatically:

"Let's eat, Grandma!" (Dinner time)
"Let's eat Grandma!" (Cannibalism)

4. Expanding Contractions

It's beneficial to normalize language by expanding standard contractions.

"They're going." → "They are going."

Python Implementation with Regex

import re

def clean_text(text):
    # 1. Convert to lowercase
    text = text.lower()
    
    # 2. Expand common contractions (simplified)
    contractions = {"don't": "do not", "isn't": "is not", "you're": "you are"}
    for word, replacement in contractions.items():
        text = text.replace(word, replacement)
    
    # 3. Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # 4. Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    
    # 5. Remove punctuation & special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # 6. Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

raw = "<h1>Wow! NLP is so COOL!!! don't click https://example.com 123</h1>"
print(clean_text(raw))
# Output: "wow nlp is so cool do not click"

Tokenization

Tokenization in NLP

Tokenization is the process of breaking down a stream of text into smaller, meaningful units called tokens. Why can't we just use the Python .split(' ') function? Because standard splitting fails on punctuation, contractions, and acronyms!

Simple split "Mr. O'Neill doesn't go." → ["Mr.", "O'Neill", "doesn't", "go."]
NLP Tokenized: → ["Mr.", "O", "'", "Neill", "does", "n't", "go", "."]

Types of Tokenization

Sentence Tokenization

Splits paragraphs into sentences. Must be smart enough to know that "Dr." or "U.S.A." doesn't end a sentence.

Word Tokenization

Splitting sentences into words and independent punctuation marks like commas and periods.

Subword Tokenization (BPE)

Used in modern LLMs (BERT/GPT). Resolves "Out Of Vocabulary" errors by splitting rare words.

"Unfriendly" → ["un", "friend", "ly"]

Implementation Code

1. Tokenization using NLTK

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Dr. Smith went to the U.S.A. Did he buy apples?"

# 1. Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentences:", sentences)
# Output: ['Dr. Smith went to the U.S.A.', 
#          'Did he buy apples?']

# 2. Word Tokenization
words = word_tokenize(sentences[0])
print("\nWords:", words)
# Output: ['Dr.', 'Smith', 'went', 'to', 
#          'the', 'U.S.A.']

2. Tokenization using spaCy

import spacy

nlp = spacy.load("en_core_web_sm")
text = "Dr. Smith went to the U.S.A."

# Process the text
doc = nlp(text)

# Tokenize
tokens = [token.text for token in doc]
print("spaCy Tokens:", tokens)

# Output: ['Dr.', 'Smith', 'went', 'to', 
#          'the', 'U.S.A.', '.']
# Notice spaCy separates the final period!

Stop Words Removal

Stop Words are the most common words in a language that typically do not add significant semantic meaning to a sentence. Words like "the", "is", "in", "and", "a" are classic examples.

Common English Stop Words

i me my myself we our you he she it the and but if or because as of at by

                            Why Remove Stop Words?
                            Reduce Dataset Size: Stop words often take up 20-30% of the text data.
Improve Performance: With less data, training models goes much faster.
Focus on Meaningful Words: Algorithms can focus on the words that actually carry the core semantic meaning (like nouns and verbs).

                        

The Danger of Stop Words Removal:
Sometimes stop words are the ENTIRE meaning of the phrase! For example, Shakespeare's quote "To be, or not to be" consists 100% of stop words. If you remove them, the phrase becomes completely empty. Therefore, modern deep learning models (like Transformer/BERT) generally do not remove stop words.

Implementation and Customization

Removing and Customizing Stop Words in Python

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Load english stopwords
stop_words = set(stopwords.words('english'))

# Customizing the stop words list
# 1. Removing a word (e.g. keeping 'not' for Sentiment Analysis)
stop_words.remove('not')

# 2. Adding domain-specific words (like 'http' for web scraping)
stop_words.add('http')
stop_words.add('www')

text = "The quick brown fox jumps over the lazy dog. it is not fast."

# 1. Tokenize first
word_tokens = word_tokenize(text)

# 2. Filter out the stop words
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]

print("Filtered Tokens:", filtered_sentence)
# Output: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', '.', 'not', 'fast', '.']
# Notice how "The", "over", "the", "it", "is" were removed, but "not" was kept!

Stemming

Stemming Techniques

Stemming is a text normalization technique that reduces words to their base or root form by chopping off prefixes or suffixes according to a fixed set of rules. It is a heuristic process that operates purely on string manipulation.

"running" , "runner" , "ran" "run"

Porter Stemmer

One of the oldest (1980) and most widely used suffix stripping algorithms. It uses 5 phases of word reduction.

"ponies" → "poni"
"caresses" → "caress"

Snowball Stemmer

Also known as the Porter2 stemmer. It is a slightly faster and more logical algorithm than the original Porter stemmer, supporting multiple languages.

Lancaster Stemmer

The most aggressive stemming algorithm. It is very fast but often chops words down to unreadable levels.

"maximum" → "maxim"

Over-stemming
When two words with different meanings are stemmed to the same root.
"universal", "university", "universe" → "univers"

Under-stemming
When two words with the same meaning are stemmed to different roots.
"alumnus", "alumni" → "alumnus", "alumni"

Comparing NLTK Stemmers in Python

from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

porter = PorterStemmer()
snowball = SnowballStemmer("english")
lancaster = LancasterStemmer()

words = ["running", "generously", "history", "historical", "better", "universities"]

print(f"{'Word':<15} | {'Porter':<15} | {'Snowball':<15} | {'Lancaster':<15}")
print("-" * 65)

for w in words:
    p_stem = porter.stem(w)
    s_stem = snowball.stem(w)
    l_stem = lancaster.stem(w)
    print(f"{w:<15} | {p_stem:<15} | {s_stem:<15} | {l_stem:<15}")

# Output observation:
# 'better' stays 'better' across all (stemming fails with irregular verbs)
# 'universities' -> 'univers' (Porter/Snowball) -> 'univers' (Lancaster)
# 'historical' -> 'histor' (Porter/Snowball) -> 'hist' (Lancaster - highly aggressive)

Lemmatization

Lemmatization, unlike Stemming, reduces words to their valid dictionary base form, known as the lemma. While stemming blindly chops off letters, lemmatization uses vocabulary and morphological analysis of words, referring to a dictionary like WordNet.

Stemming vs Lemmatization

Original Word	Stemming	Lemmatization
Studying	studi	study
Better	better	good
Mice	mice	mouse
Was / Were	wa / were	be

Lemmatization is computationally more expensive but yields much higher quality, readable text.

The Importance of POS Context

To correctly lemmatize an irregular word, the algorithm must know its Part-Of-Speech (POS) context. For example, look at the word "saw":

"He saw a bird." (Verb) → lemma is see
"He cut it with a saw." (Noun) → lemma is saw

NLTK's lemmatizer defaults to Noun if you don't provide the POS tag!

NLTK vs spaCy Lemmatization

In industry practice, spaCy is preferred for lemmatization because it automatically calculates POS tags in the background before applying lemmatization, avoiding manual tag definitions.

NLTK (Requires Manual POS)

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Default POS is 'n' (Noun)
print(lemmatizer.lemmatize("better"))   
# Output: better (Incorrect!)

# Changing POS to 'a' (Adjective)
print(lemmatizer.lemmatize("better", pos="a")) 
# Output: good (Correct!)

# Changing POS to 'v' (Verb)
print(lemmatizer.lemmatize("running", pos="v")) 
# Output: run (Correct!)

spaCy (Automatic POS)

import spacy

# Load language model
nlp = spacy.load("en_core_web_sm")

text = "The mice were running better than before."
doc = nlp(text)

print(f"{'Word':<10} | {'Lemma':<10}")
print("-" * 25)
for token in doc:
    print(f"{token.text:<10} | {token.lemma_:<10}")

# Output:
# mice       | mouse
# were       | be
# running    | run
# better     | well

Text Cleaning

Text Cleaning and Preprocessing

Common Text Cleaning Techniques

1. Lowercasing

2. Removing HTML or Markup

3. Removing Punctuation vs Keeping Punctuation

4. Expanding Contractions

Tokenization

Tokenization in NLP

Types of Tokenization

Sentence Tokenization

Word Tokenization

Subword Tokenization (BPE)

Implementation Code

Stop Words Removal

Stop Words Removal

Common English Stop Words

Why Remove Stop Words?

Implementation and Customization

Stemming

Stemming Techniques

Porter Stemmer

Snowball Stemmer

Lancaster Stemmer

Lemmatization

Lemmatization

Stemming vs Lemmatization

The Importance of POS Context

NLTK vs spaCy Lemmatization

Redirecting to NLP Syntax