The US plans to stop releasing its most detailed census data

Hands off.
Hands off.
Image: AP Photo/Jason E. Miczek
By
We may earn a commission from links on this page.

The US Census Bureau is concerned about privacy. Maybe a little too concerned.

As a data-focused journalist who writes about economic and demographic trends, I use census data a lot. Specifically, I rely on the individual-level microdata that is released by the bureau and turned into an easily usable format by the Minnesota Population Center. I am among tens of of thousands (pdf) of data analysts who rely on this data to study American poverty, health, and population patterns. The US Census announced this week that, because of privacy concerns, this microdata will no longer be made widely available.

Microdata are data that are not yet summarized into tables, but available for download in their raw form. In the case of the census, microdata features detailed information about specific individuals, but not their names or addresses. Working with this data means that analysts don’t have to rely on the aggregate statistics the agency calculates. Instead, they can calculate anything they want by constructing their own populations from individual-level responses. Microdata are available for the decennial census going back to 1850 and for the annual American Community Survey, which has been published since 2000. For both surveys, a massive amount of demographic and economic data is collected.

I’ve often wondered when using this wealth of data whether I could uncover actual people. Take this hypothetical: Imagine that if in the 2017 American Community Survey, I found that there is a 35-year-old man who is plumber and lives in New Rochelle, New York. From the microdata, I also know he has three kids and his wife is an elementary school teacher. Could I identify him in real life? If I could, I would be able to glean private information like his income and health insurance status from a public database released by the government.

But even if I did find the plumber in New Rochelle, I couldn’t be sure that the income and health insurance information listed for him is correct, because the agency takes measures to anonymize its data. For example, the bureau introduces randomness by swapping the ages or races of otherwise demographically similar households.

The Census Bureau now thinks this might not be enough. Its new approach is called “differential privacy“—a method already used by tech companies like Apple and Facebook in their attempts to protect the personal data of users. Differential privacy works by introducing so much randomness into the data that it is mathematically impossible to accurately reconstruct confidential information. One upshot of differential privacy is that microdata cannot be offered to the public.

Steven Ruggles, director of the University of Minnesota’s Institute for Social Research and Data Innovation, find this decision baffling. In a report he prepared on the differential privacy decision (pdf), he points out that there “not a single documented case” of a data analyst revealing the responses of a person in the real world. He says the data swapping and randomization process that is currently used is already quite strong. He calls the idea that the census needs tighter privacy rules “chimerical.” Yes, he acknowledges, it is possible to see people’s personal data, but if you can’t identify who they actually are, who cares?

Ruggles isn’t the only researcher who is upset. Many academics and journalists fear that differential privacy will impede their research. Although the bureau suggests that much of the data will still be accessible in secure federally run data centers across the country, access to this information will become more expensive and time-consuming.

The decision may not be final. There were similar rumblings about privacy issues prior to the 2000 census that were eventually resolved. Ruggles is hopeful that a compromise can be reached, making the information even more secure while maintaining wide access to this incredibly rich data source.