Text Cleaning
Learn essential text cleaning techniques for NLP including noise removal, handling HTML tags, and filtering special characters.
Text Cleaning and Preprocessing
Before any sophisticated NLP algorithm can be applied, raw text must be cleaned. Text cleaning is often the most time-consuming part of an NLP project, as real-world data is inherently messy.
Common Text Cleaning Techniques
1. Lowercasing
Converting all characters to lowercase ensures that "Apple" and "apple" are treated as the same word, reducing the total vocabulary size.
"I LOVE Python!" → "i love python!"2. Removing HTML or Markup
When scraping data from the web, removing HTML tags using libraries like BeautifulSoup or Regex is crucial so code snippets aren't treated as words.
"<p>Hello World</p>" → "Hello World"3. Removing Punctuation vs Keeping Punctuation
For simple frequency tasks (like Spam detection), punctuation adds noise. However, punctuation can change semantics dramatically:
"Let's eat Grandma!" (Cannibalism)
4. Expanding Contractions
It's beneficial to normalize language by expanding standard contractions.
"They're going." → "They are going."import re
def clean_text(text):
# 1. Convert to lowercase
text = text.lower()
# 2. Expand common contractions (simplified)
contractions = {"don't": "do not", "isn't": "is not", "you're": "you are"}
for word, replacement in contractions.items():
text = text.replace(word, replacement)
# 3. Remove HTML tags
text = re.sub(r'<.*?>', '', text)
# 4. Remove URLs
text = re.sub(r'http\S+|www\S+|https\S+', '', text)
# 5. Remove punctuation & special characters
text = re.sub(r'[^a-zA-Z\s]', '', text)
# 6. Remove extra whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
raw = "<h1>Wow! NLP is so COOL!!! don't click https://example.com 123</h1>"
print(clean_text(raw))
# Output: "wow nlp is so cool do not click"