Lecture: When does data need ‘liberating’?

Total suggested time: 15 minutes

One of the key differences between what we think of as “data” and other types of information is structure. That structure for spreadsheets means rows and columns, but there are other types you’ll learn about later.

When information is unstructured, though, we’ll need to fix that if we want to perform any sort of meaningful analysis on it.

For our purposes, liberating data means imposing some kind of structure on the information.

That can sometimes be a lot of work, and how we actually impose that structure is up to us as data journalists.

It means making choices. We have to be mindful of how those choices can impact the final outcome of our analysis.

There are three primary reasons why liberating data might be necessary.

1. The data you want does not exist.

Although our world is swimming in data, you’ll find quickly that there are aspects of our society that just aren’t tracked. Sometimes what’s not tracked is pretty important.

Mimi Ọnụọha, an artist and writer whose work specializes in interrogating the impacts of technological progress, defines “missing data sets” this way:

“The word “missing” is inherently normative. It implies both a lack and an ought: something does not exist, but it should. That which should be somewhere is not in its expected place; an established system is disrupted by distinct absence. Just because some type of data doesn’t exist doesn’t mean it’s missing, and the idea of missing data sets is inextricably tied to a more expansive climate of inevitable and routine data collection.”

Consider, for example, the 2012 USA Today series Ghost Factories, which found the Environmental Protection Agency failed to test and track the impacts of shuttered lead smelters on neighborhoods across the country.

So the reporting team tested the sites on their own.

2. The data you want exists, but is encumbered.

What if we wanted to analyze the thousands of presidential appointees scattered across the U.S. government?

The federal government absolutely tracks that data – publishes it in a book every four years, in fact. But prior to 2012, it was only contained in PDF and poorly structured HTML.

To analyze this data, journalists had to literally free it from these clunky formats.

With the data in hand, New York Times reporter Annie Lowery was able to reveal powerful insights into the nature of these political positions, including the gender imbalance in some cabinet departments.

3. The data you want exists, but is incomplete.

In the five-year period from 2018 to 2022, police officers in the U.S. shot and killed more than 5,000 people.

That’s not a figure the government publishes – it comes from The Washington Post’s database of police shootings.

In this case, the data is tracked, but it’s missing too much information for it to be reliable, as the Post explains:

“The FBI and the Centers for Disease Control and Prevention log fatal shootings by police, but officials acknowledge that their data is incomplete. Since 2015, The Post has documented more than twice as many fatal shootings by police as recorded by federal officials on average annually. That gap has widened in recent years, as the FBI in 2021 tracked only a third of departments’ fatal shootings.”

Tracking this data takes an enormous effort on the part of the reporting team, but it provides revelatory insights into a problem that continues to make headlines.

Retaining the existing structure of our world

Consider this note from the Post’s Fatal Force Database methodology:

“In this data set, The Post tracks only shootings… in which a police officer, in the line of duty, shoots and kills a civilian. The Post is not tracking deaths of people in police custody, fatal shootings by off-duty officers or non-shooting deaths in this data set.”

This decision limits the data. But this methodology is neither right nor wrong – it’s a choice. And one the team is making explicit in its reporting.

You’ll have to make similar choices for yourself as you wrestle with the liberation of data. It will often mean balancing two things:

The effort it takes to capture or clean the data
Capturing or cleaning enough data for a thorough analysis.

Previous submodule:

The setup

Next submodule:

Tracking COVID by state