Text Summarization
Master extractive and abstractive summarization using NLTK and Transformers.
The Art of Compression
Text Summarization is the task of shortening a text document to create a summary that retains the most important points of the original text. There are two primary types:
Extractive
Original sentences are ranked by importance, and the top ones are selected verbatim. (Think: Highlighting key parts of a book)
Abstractive
The machine understands the context and generates entirely new sentences to convey the message. (Think: Paraphrasing in its own words)
Level 1 — Extractive Summarization (NLTK)
Simple extractive summarization works by calculating word frequencies and scoring sentences based on those frequencies.
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
text = """Artificial intelligence is a branch of computer science that deals with creating smart machines.
It has become a central part of the technology industry.
The research associated with artificial intelligence is highly technical and specialized.
The core problems of artificial intelligence include programming computers for certain traits such as knowledge and reasoning."""
# 1. Word frequency table
stopWords = set(stopwords.words("english"))
words = word_tokenize(text)
freqTable = dict()
for word in words:
word = word.lower()
if word not in stopWords:
freqTable[word] = freqTable.get(word, 0) + 1
# 2. Score sentences
sentences = sent_tokenize(text)
sentenceValue = dict()
for sentence in sentences:
for word, freq in freqTable.items():
if word in sentence.lower():
if sentence in sentenceValue:
sentenceValue[sentence] += freq
else:
sentenceValue[sentence] = freq
# 3. Get top sentences
summary = sorted(sentenceValue, key=sentenceValue.get, reverse=True)[:2]
print("Summary:\n", " ".join(summary))
Level 2 — Abstractive Summarization (Hugging Face)
Modern abstractive summarization uses Large Language Models (LLMs) like BART or T5 to rephrase the source content with high fluency.
from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
ARTICLE = """The Apollo 11 mission was the first manned mission to land on the Moon.
Launched by NASA on July 16, 1969, it carried Commander Neil Armstrong and lunar module pilot Buzz Aldrin.
Armstrong became the first person to step onto the lunar surface on July 21, 1969.
The event was broadcast on live TV to a worldwide audience and is considered a milestone in human history."""
summary = summarizer(ARTICLE, max_length=50, min_length=20, do_sample=False)
print(summary[0]['summary_text'])