Text Cleaning Tutorial Section

Text Cleaning

Learn essential text cleaning techniques for NLP including noise removal, handling HTML tags, and filtering special characters.

Text Cleaning and Preprocessing

Before any sophisticated NLP algorithm can be applied, raw text must be cleaned. Text cleaning is often the most time-consuming part of an NLP project, as real-world data is inherently messy.

Rule of Thumb: Garbage In, Garbage Out. The quality of your text cleaning pipeline directly dictates the upper limit of your model's final performance!

Common Text Cleaning Techniques

1. Lowercasing

Converting all characters to lowercase ensures that "Apple" and "apple" are treated as the same word, reducing the total vocabulary size.

"I LOVE Python!" → "i love python!"

2. Removing HTML or Markup

When scraping data from the web, removing HTML tags using libraries like BeautifulSoup or Regex is crucial so code snippets aren't treated as words.

"<p>Hello World</p>" → "Hello World"

3. Removing Punctuation vs Keeping Punctuation

For simple frequency tasks (like Spam detection), punctuation adds noise. However, punctuation can change semantics dramatically:

"Let's eat, Grandma!" (Dinner time)
"Let's eat Grandma!" (Cannibalism)

4. Expanding Contractions

It's beneficial to normalize language by expanding standard contractions.

"They're going." → "They are going."

Python Implementation with Regex

import re

def clean_text(text):
    # 1. Convert to lowercase
    text = text.lower()
    
    # 2. Expand common contractions (simplified)
    contractions = {"don't": "do not", "isn't": "is not", "you're": "you are"}
    for word, replacement in contractions.items():
        text = text.replace(word, replacement)
    
    # 3. Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # 4. Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    
    # 5. Remove punctuation & special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # 6. Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

raw = "<h1>Wow! NLP is so COOL!!! don't click https://example.com 123</h1>"
print(clean_text(raw))
# Output: "wow nlp is so cool do not click"

Previous: Linguistics Basics