Stop Words Removal
Tutorial Section
Stop Words Removal
Learn how to filter out common and uninformative words in text preprocessing using NLTK stop words list.
Stop Words Removal
Stop Words are the most common words in a language that typically do not add significant semantic meaning to a sentence. Words like "the", "is", "in", "and", "a" are classic examples.
Common English Stop Words
i
me
my
myself
we
our
you
he
she
it
the
and
but
if
or
because
as
of
at
by
Why Remove Stop Words?
- Reduce Dataset Size: Stop words often take up 20-30% of the text data.
- Improve Performance: With less data, training models goes much faster.
- Focus on Meaningful Words: Algorithms can focus on the words that actually carry the core semantic meaning (like nouns and verbs).
The Danger of Stop Words Removal:
Sometimes stop words are the ENTIRE meaning of the phrase! For example, Shakespeare's quote "To be, or not to be" consists 100% of stop words. If you remove them, the phrase becomes completely empty. Therefore, modern deep learning models (like Transformer/BERT) generally do not remove stop words.
Sometimes stop words are the ENTIRE meaning of the phrase! For example, Shakespeare's quote "To be, or not to be" consists 100% of stop words. If you remove them, the phrase becomes completely empty. Therefore, modern deep learning models (like Transformer/BERT) generally do not remove stop words.
Implementation and Customization
Removing and Customizing Stop Words in Python
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Load english stopwords
stop_words = set(stopwords.words('english'))
# Customizing the stop words list
# 1. Removing a word (e.g. keeping 'not' for Sentiment Analysis)
stop_words.remove('not')
# 2. Adding domain-specific words (like 'http' for web scraping)
stop_words.add('http')
stop_words.add('www')
text = "The quick brown fox jumps over the lazy dog. it is not fast."
# 1. Tokenize first
word_tokens = word_tokenize(text)
# 2. Filter out the stop words
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
print("Filtered Tokens:", filtered_sentence)
# Output: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', '.', 'not', 'fast', '.']
# Notice how "The", "over", "the", "it", "is" were removed, but "not" was kept!