Stop Words Removal Tutorial Section

Stop Words Removal

Learn how to filter out common and uninformative words in text preprocessing using NLTK stop words list.

Stop Words Removal

Stop Words are the most common words in a language that typically do not add significant semantic meaning to a sentence. Words like "the", "is", "in", "and", "a" are classic examples.

Common English Stop Words

i me my myself we our you he she it the and but if or because as of at by

                            Why Remove Stop Words?
                            Reduce Dataset Size: Stop words often take up 20-30% of the text data.
Improve Performance: With less data, training models goes much faster.
Focus on Meaningful Words: Algorithms can focus on the words that actually carry the core semantic meaning (like nouns and verbs).

                        

The Danger of Stop Words Removal:
Sometimes stop words are the ENTIRE meaning of the phrase! For example, Shakespeare's quote "To be, or not to be" consists 100% of stop words. If you remove them, the phrase becomes completely empty. Therefore, modern deep learning models (like Transformer/BERT) generally do not remove stop words.

Implementation and Customization

Removing and Customizing Stop Words in Python

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Load english stopwords
stop_words = set(stopwords.words('english'))

# Customizing the stop words list
# 1. Removing a word (e.g. keeping 'not' for Sentiment Analysis)
stop_words.remove('not')

# 2. Adding domain-specific words (like 'http' for web scraping)
stop_words.add('http')
stop_words.add('www')

text = "The quick brown fox jumps over the lazy dog. it is not fast."

# 1. Tokenize first
word_tokens = word_tokenize(text)

# 2. Filter out the stop words
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]

print("Filtered Tokens:", filtered_sentence)
# Output: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', '.', 'not', 'fast', '.']
# Notice how "The", "over", "the", "it", "is" were removed, but "not" was kept!

Previous: Tokenization