Data Mining Made Easy — A Primer

November 22, 2016

Reporter’s Toolbox: Data Mining Made Easy — A Primer

By Karen Hao

One sign of the journalism industry’s rapid transformation is the growing expectation, as more magazines and newspapers transition to digital-first publications, for journalists to add data storytelling tools to their arsenal.

If you’re new to data journalism, it may seem intimidating at first to learn data methods. But as a relative beginner in the field myself, I can assure you that with a little patience and diligence, picking up a few data tricks is relatively painless and straightforward.

A tale of two stories

Let’s start this primer with two examples of excellent data stories. Both case studies come by way of the Data Journalism 101 workshop at the Sacramento SEJ conference this year (you can listen to the audio recording here).

The first is InsideClimate News’ Lisa Song’s “Few Oil Pipeline Spills Detected by Much-Touted Technology.” As the first data story Song ever wrote, it had come about largely by accident.

While poking around various databases to practice looking for stories, she downloaded an available database that catalogued oil spills from the Department of Transportation’s Pipeline and Hazardous Materials Safety Administration. In the Excel spreadsheet, she found a column that noted the detection method of the oil spill. Through a quick calculation, she discovered that only 5% of the leaks had been discovered by the highly-touted leak detection sensors installed by the oil companies themselves. Once she verified these results directly with the agency, this became the lede for her story.

It’s a great example of how good data stories don’t need to use fancy computational methods. Song relied only on basic Excel and SQL, a standard language for accessing data within databases. At the Sacramento workshop, she also made an important point that the bulk of her work on the story was still the reporting; the data analysis took a small fraction of the time.

The second example is Seattle Times’ coverage of the Oso landslide, and shows a broader and more complex use of data methods.

In March of 2014, a massive mudslide decimated a neighborhood situated directly in its path, killing 43 people. At the time, Stanford journalism professor Cheryl Phillips — who presented the story series at the workshop — was the data innovation editor at the paper, and her team worked to investigate the destruction.

Prior to the Oso disaster, the team’s work with the University of Washington had uncovered a connection between the region’s logging practices and high frequency of landslides. By overlaying all of the region’s logging permits on a map of all of the slopes’ steepness and stability conditions, they had proved that the state had granted all permits without regard to safety protocols. Using this knowledge, they re-applied the same analysis and mapping techniques to investigate the permitting processes that preceded Oso. They concluded that the neighborhood should have never been built and the disaster had once again been a result of state negligence.

Alongside the narrative reporting of the incident, Phillips’ team created an interactive map of the victims that showed their faces and bios at the locations of their deaths, to further drive home the tragedy of government inaction.

This example demonstrates the ways in which data methods can be supplementary throughout all the stages of reporting — from uncovering the lede to presenting the story.

With these two case studies in mind, we can now cover some of the tools and resources for data reporting, organized under the four steps of data storytelling.

Step 1: Data Collection

Data Sources: The first step to writing any data story is finding and compiling the data itself. If you’re lucky, there’s a readily available database already compiled in your topic of interest. Below are a smattering of interesting data sources that I’ve found over time:

any of the datasets on the U.S. Energy Information Administration’s website
the Statistics Division of the Food and Agriculture Organization of the United Nations

Jeremy Singer-Vine’s Data Is Plural newsletter
Data.gov, home of the federal government’s open databases
Data USA, the brainchild of the MIT Media Lab MacroConnections group and the most comprehensive website and visualization engine of public U.S. government data

Web Scrapers: If, instead, you need to compile the data yourself, it’ll be good to familiarize yourself with one or two web scraping tools. Web scrapers allow you to automate the extraction of data from a website. An example of this could be downloading all of the pdf documents spread out across dozens of pages in a website. Although you could perform the task manually, you can save yourself a good chunk of time by automating it.

For beginners, Import.io and the Chrome extension Scraper are powerful, easy-to-use tools that can pull data from web pages into spreadsheets.
For learners with coding chops, take a look at Javascripts’ jQuery and Python’s Beautiful Soup libraries.

Additional reading: “Scraping the Web” by Noah Veltman

PDF Processing Tools: And finally, if you’re so unfortunate as to have the data you need locked up in pdfs, there are also pdf processing tools.

CometDocs can convert entire pdf documents into Word or Excel files, whereas
Tabula can convert tabular data in specific locations into manipulable spreadsheets.
Additional reading: “Busting data out of PDFs” from the Stanford Computational Journalism Lab

Step 2: Data Cleansing

Often times, the data you collect will have imperfections. Whether they are duplicate entries, misspellings or formatting issues, they will affect your downstream analyses if they aren’t smoothed or weeded out first.

Excel/Google Sheets can perform reorganization tasks, like splitting larger chunks of text into multiple columns, that make it easier to parse. Additional reading: “Excel: Extract text with FIND and MID” by Anthony DeBarros
OpenRefine helps comb through large datasets by quickly teasing out anomalies, entry variations and other subtle abnormalities. Additional reading: “Using Google Refine to Clean Messy Data” by Dan Nguyen

As a cautionary note, sometimes the issues within a dataset are not actually resolvable or worth the time to tackle. Quartz has an excellent guide for the most common pitfalls to avoid when confronted with poor data quality. Additional reading: “The Quartz guide to bad data” from Quartz

Step 3: Data Exploration & Analysis

OK, your data is squeaky clean, and it’s time to figure out what’s actually in your dataset. During the SEJ’s Sacramento workshop, Phillips mentioned that she often begins her analysis by visualizing her data to understand its patterns and trends. In one project she recounted how, by mapping out a spike in methadone overdose fatalities, her team discovered its correlation to low-income communities, effectively changing the lede of their story.

Depending on the nature of your dataset, exploration and analysis can take on many forms.

For quantitative data:

WTFcsv from DataBasic automatically generates visualizations from csv files to help you skim potential correlations.
Tableau Public is an excellent and powerful tool for creating beautiful visualizations.
And of course, the chart functionality on Excel and Google Sheets shouldn’t be overlooked.

For qualitative data:

WordCounter (also from DataBasic) generates word clouds from chunks of text.
Overview helps search, visualize and review hundreds of thousands of documents in any format.

For geographical data:

Tableau Public comes in handy again for prototyping interactive maps.
Esri’s ArcGIS Online (sadly not free) generates more sophisticated, higher-quality maps.

Step 4: Data Storytelling

Finally, you’re ready to tell your story. Perhaps, like Song, the analysis has given you a lede to pursue and the rest of your work will involve traditional reporting techniques; or like Phillips, the story begs to be told through visualization. If the latter, the visualizations you create during your exploration could very well be the same visualizations to present to your readers. By the way, both Song and Phillips mentioned the important role designers play during this process.

Since the tools for visual storytelling are largely the same as the tools above, here is a list of inspiring examples instead:

map: Yale Climate Change Communications’ Climate Opinion Maps
graphic: RJ Andrew’s “Endangered Safari”
bar chart: FiveThirtyEight’s “Hip-hop Is Turning on Donald Trump”
line graph: Bloomberg’s “What’s Really Warming the World?”
timeline: xkcd’s Earth Temperature Timeline
documentary: Neil Halloran’s “The Fallen of World War II”
interactive essay: FiveThirtyEight’s “Gun Deaths in America”

And for good measure, here’s a great article about the number one pitfall to avoid:

Additional reading: “Disinformation Visualization: How to lie with datavis” by Mushon Zer-Aviv

Additional resources

To continue your data journalism education, below are more excellent resources. First, the free:

Stanford Computational Journalism Lab posts a lot of its course content.
Tableau’s Resources include how-to videos, sample data sets and live training sessions.
Codecademy teaches introductory classes for Python, JavaScript, HTML/CSS and more.
Hacks/Hackers is a community of journalists and technologists with chapters that organize local meetups around the world.
IRE’s NICAR mailing list is a great way to listen in on techno-journo discussions and learn by exposure.
UC Berkeley’s Advanced Media Institute hosts online webinars.
Northwestern’s Knight Lab works on a breadth of projects that are interesting to learn from.

And the not-so-free:

IRE’s data training boot camps vary between three and six days, and offer financial assistance to minority journalists.
Boston University and Berkeley’s Advanced Media Institute both conduct a variety of workshops that range in length and price.

Special thanks goes to Lisa Song, Cheryl Phillips, the Stanford Computational Journalism Lab, MIT Professor Rahul Bhargava’s Data Storytelling Studio and NICAR — the unknowing sources that proffered me with the majority of this material.

Karen Hao received her B.S. in mechanical engineering from MIT and worked as an application engineer at a startup before transitioning into journalism. She is currently an intern at SIERRA magazine in San Francisco and will begin a fellowship with Mother Jones in December.

* From the weekly news magazine SEJournal Online, Vol. 1, No. 4. Content from each new issue of SEJournal Online is available to the public via the SEJournal Online main page. Subscribe to the e-newsletter here. And see past issues of the SEJournal archived here.

SEJ Publication Types:

Topics on the Beat:

Visibility:

Data Mining Made Easy — A Primer

Reporter’s Toolbox: Data Mining Made Easy — A Primer

Search Publications

Browse Publications

Subscriptions