Text summarization is the task of creating a short, accurate, and fluent summary of an article.
A popular and free dataset for use in text summarization experiments with deep learning methods is the CNN News story dataset.
In this tutorial, you will discover how to prepare the CNN News Dataset for text summarization.
After completing this tutorial, you will know:
- About the CNN News dataset and how to download the story data to your workstation.
- How to load the dataset and split each article into story text and highlights.
- How to clean the dataset ready for modeling and save the cleaned data to file for later use.
Let’s get started.

How to Prepare News Articles for Text Summarization
Photo by DieselDemon, some rights reserved.
Tutorial Overview
This tutorial is divided into 5 parts; they are:
- CNN News Story Dataset
- Inspect the Dataset
- Load Data
- Data Cleaning
- Save Clean Data
Need help with Deep Learning for Text Data?
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Start Your FREE Crash-Course Now
CNN News Story Dataset
The DeepMind Q&A Dataset is a large collection of news articles from CNN and the Daily Mail with associated questions.
The dataset was developed as a question and answering task for deep learning and was presented in the 2015 paper “Teaching Machines to Read and Comprehend.”
This dataset has been used in text summarization where sentences from the news articles are summarized. Notable examples are the papers:
Kyunghyun Cho is an academic at New York University and has made the dataset available for download:
In this tutorial, we will work with the CNN dataset, specifically the download of the ASCII text of the news stories available here:
This dataset contains more than 93,000 news articles where each article is stored in a single “.story” file.
Download this dataset to your workstation and unzip it. Once downloaded, you can unzip the archive on your command line as follows:
This will create a cnn/stories/ directory filled with .story files.
For example, we can count the number of story files on the command line as follows:
Which shows us that we have a total of 92,580 stores.
Inspect the Dataset
Using a text editor, review some of the stories and note down some ideas for preparing this data.
For example, below is an example of a story, with the body truncated for brevity.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
|
(CNN) — If you travel by plane and arriving on time makes a difference, try to book on Hawaiian Airlines. In 2012, passengers got where they needed to go without delay on the carrier more than nine times out of 10, according to a study released on Monday.
In fact, Hawaiian got even better from 2011, when it had a 92.8% on-time performance. Last year, it improved to 93.4%.
[…]
@highlight
Hawaiian Airlines again lands at No. 1 in on-time performance
@highlight
The Airline Quality Rankings Report looks at the 14 largest U.S. airlines
@highlight
ExpressJet and American Airlines had the worst on-time performance
@highlight
Virgin America had the best baggage handling; Southwest had lowest complaint rate |
I note that the general structure of the dataset is to have the story text followed by a number of “highlight” points.
Reviewing articles on the CNN website, I can see that this pattern is still common.

Example of a CNN News Article With Highlights from cnn.com
The ASCII text does not include the article titles, but we can use these human-written “highlights” as multiple reference summaries for each news article.
I can also see that many articles start with source information, presumably the CNN office that produced the story; for example:
|
(CNN) — Gaza City (CNN) — Los Angeles (CNN) — |
These can be removed completely.
Data cleaning is a challenging problem and must be tailored for the specific application of the system.
If we are generally interested in developing a news article summarization system, then we may clean the text in order to simplify the learning problem by reducing the size of the vocabulary.
Some data cleaning ideas for this data include.
- Normalize case to lowercase (e.g. “An Italian”).
- Remove punctuation (e.g. “on-time”).
We could also further reduce the vocabulary to speed up testing models, such as:
- Remove numbers (e.g. “93.4%”).
- Remove low-frequency words like names (e.g. “Tom Watkins”).
- Truncating stories to the first 5 or 10 sentences.
Load Data
The first step is to load the data.
We can start by writing a function to load a single document given a filename. The data has some unicode characters, so we will load the dataset by forcing the encoding to be UTF-8.
The function below named load_doc() will load a single document as text given a filename.
|
# load doc into memory def load_doc(filename): # open the file as read only file = open(filename, encoding=‘utf-8’) # read all text text = file.read() # close the file file.close() return text |
Next, we need to step over each filename in the stories directory and load them.
We can use the listdir() function to load all filenames in the directory, then load each one in turn. The function below named load_stories() implements this behavior and provides a starting point for preparing the loaded documents.
|
# load all stories in a directory def load_stories(directory): for name in listdir(directory): filename = directory + ‘/’ + name # load document doc = load_doc(filename) |
Each document can be separated into the news story text and the highlights or summary text.
The split for these two points is the first occurrence of the ‘@highlight‘ token. Once split, we can organize the highlights into a list.
The function below named split_story() implements this behavior and splits a given loaded document text into a story and list of highlights.
|
# split a document into news story and highlights def split_story(doc): # find first highlight index = doc.find(‘@highlight’) # split into story and highlights story, highlights = doc[:index], doc[index:].split(‘@highlight’) # strip extra white space around each highlight highlights = [h.strip() for h in highlights if len(h) > 0] return story, highlights |
We can now update the load_stories() function to call the split_story() function for each loaded document and then store the results in a list.
|
# load all stories in a directory def load_stories(directory): all_stories = list() for name in listdir(directory): filename = directory + ‘/’ + name # load document doc = load_doc(filename) # split into story and highlights story, highlights = split_story(doc) # store all_stories.append(‘story’:story, ‘highlights’:highlights) return all_stories |
Tying all of this together, the complete example of loading the entire dataset is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
|
from os import listdir
# load doc into memory def load_doc(filename): # open the file as read only file = open(filename, encoding=‘utf-8’) # read all text text = file.read() # close the file file.close() return text
# split a document into news story and highlights def split_story(doc): # find first highlight index = doc.find(‘@highlight’) # split into story and highlights story, highlights = doc[:index], doc[index:].split(‘@highlight’) # strip extra white space around each highlight highlights = [h.strip() for h in highlights if len(h) > 0] return story, highlights
# load all stories in a directory def load_stories(directory): stories = list() for name in listdir(directory): filename = directory + ‘/’ + name # load document doc = load_doc(filename) # split into story and highlights story, highlights = split_story(doc) # store stories.append(‘story’:story, ‘highlights’:highlights) return stories
# load stories directory = ‘cnn/stories/’ stories = load_stories(directory) print(‘Loaded Stories %d’ % len(stories)) |
Running the example prints the number of loaded stories.
We can now access the loaded story and highlight data, for example:
|
print(stories[4][‘story’]) print(stories[4][‘highlights’]) |
Data Cleaning
Now that we can load the story data, we can pre-process the text by cleaning it.
We can process the stories line-by line and use the same cleaning operations on each highlight line.
For a given line, we will perform the following operations:
Remove the CNN office information.
|
# strip source cnn office if it exists index = line.find(‘(CNN) — ‘) if index > –1: line = line[index+len(‘(CNN)’):] |
Split the line using white space tokens:
|
# tokenize on white space line = line.split() |
Normalize the case to lowercase.
|
# convert to lower case line = [word.lower() for word in line] |
Remove all punctuation characters from each token (Python 3 specific).
|
# prepare a translation table to remove punctuation table = str.maketrans(”, ”, string.punctuation) # remove punctuation from each token line = [w.translate(table) for w in line] |
Remove any words that have non-alphabetic characters.
|
# remove tokens with numbers in them line = [word for word in line if word.isalpha()] |
Putting this all together, below is a new function named clean_lines() that takes a list of lines of text and returns a list of clean lines of text.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
|
# clean a list of lines def clean_lines(lines): cleaned = list() # prepare a translation table to remove punctuation table = str.maketrans(”, ”, string.punctuation) for line in lines: # strip source cnn office if it exists index = line.find(‘(CNN) — ‘) if index > –1: line = line[index+len(‘(CNN)’):] # tokenize on white space line = line.split() # convert to lower case line = [word.lower() for word in line] # remove punctuation from each token line = [w.translate(table) for w in line] # remove tokens with numbers in them line = [word for word in line if word.isalpha()] # store as string cleaned.append(‘ ‘.join(line)) # remove empty strings cleaned = [c for c in cleaned if len(c) > 0] return cleaned |
We can call this for a story, by first converting it to a line of text. The function can be called directly on the list of highlights.
|
example[‘story’] = clean_lines(example[‘story’].split(‘n’)) example[‘highlights’] = clean_lines(example[‘highlights’]) |
The complete example of loading and cleaning the dataset is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
|
from os import listdir import string
# load doc into memory def load_doc(filename): # open the file as read only file = open(filename, encoding=‘utf-8’) # read all text text = file.read() # close the file file.close() return text
# split a document into news story and highlights def split_story(doc): # find first highlight index = doc.find(‘@highlight’) # split into story and highlights story, highlights = doc[:index], doc[index:].split(‘@highlight’) # strip extra white space around each highlight highlights = [h.strip() for h in highlights if len(h) > 0] return story, highlights
# load all stories in a directory def load_stories(directory): stories = list() for name in listdir(directory): filename = directory + ‘/’ + name # load document doc = load_doc(filename) # split into story and highlights story, highlights = split_story(doc) # store stories.append(‘story’:story, ‘highlights’:highlights) return stories
# clean a list of lines def clean_lines(lines): cleaned = list() # prepare a translation table to remove punctuation table = str.maketrans(”, ”, string.punctuation) for line in lines: # strip source cnn office if it exists index = line.find(‘(CNN) — ‘) if index > –1: line = line[index+len(‘(CNN)’):] # tokenize on white space line = line.split() # convert to lower case line = [word.lower() for word in line] # remove punctuation from each token line = [w.translate(table) for w in line] # remove tokens with numbers in them line = [word for word in line if word.isalpha()] # store as string cleaned.append(‘ ‘.join(line)) # remove empty strings cleaned = [c for c in cleaned if len(c) > 0] return cleaned
# load stories directory = ‘cnn/stories/’ stories = load_stories(directory) print(‘Loaded Stories %d’ % len(stories))
# clean stories for example in stories: example[‘story’] = clean_lines(example[‘story’].split(‘n’)) example[‘highlights’] = clean_lines(example[‘highlights’]) |
Note that the story is now stored as a list of clean lines, nominally, separated by sentences.
Save Clean Data
Finally, now that the data has been cleaned, we can save it to file.
An easy way to save the cleaned data is to Pickle the list of stories and highlights.
For example:
|
# save to file from pickle import dump dump(stories, open(‘cnn_dataset.pkl’, ‘wb’)) |
This will create a new file named cnn_dataset.pkl with all of the cleaned data. This file will be about 374 Megabytes in size.
We can then load it later and use it with a text summarization model as follows:
|
# load from file stories = load(open(‘cnn_dataset.pkl’, ‘rb’)) print(‘Loaded Stories %d’ % len(stories)) |
Further Reading
This section provides more resources on the topic if you are looking go deeper.
Summary
In this tutorial, you discovered how to prepare the CNN News Dataset for text summarization.
Specifically, you learned:
- About the CNN News dataset and how to download the story data to your workstation.
- How to load the dataset and split each article into story text and highlights.
- How to clean the dataset ready for modeling and save the cleaned data to file for later use.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.