Big data—the kind that statisticians and computer scientists scour for insights on human beings and our societies—is cooked up using a recipe that’s been used a thousand times. Here’s how it goes: Acquire a trove of people’s highly personal data—say, medical records or shopping history. Run that huge set through a “de-identification” process to anonymize the data. And voila—individuals become anonymous, chartable, and unencumbered by personal privacy concerns.
So what’s the problem? It turns out that all that de-identified data may not be so anonymous after all.
So argues Arvind Narayanan, a Princeton computer scientist who first made waves in the privacy community by co-authoring a 2006 paper showing that Netflix users and their entire rental histories could be identified by cross-referencing supposedly anonymous Netflix ratings with the Internet Movie Database. Narayanan and fellow Princeton professor Edward Felten delivered the latest blow to the case of de-identification proponents (those who maintain that de-identification is viable) with a July 9 paper that makes a serious case for data paranoia.
They argue that de-identification doesn’t work—in theory or in practice—and that those who say it does are promoting a “false sense of security” by naively underestimating the attackers who might try to deduce personal information from big data. Here are Narayanan and Felten’s main points:
Personal location data isn’t really anonymous
A 2013 study showed that given a large dataset of human mobility data collected from smartphones, 95% of individuals were uniquely identifiable from as few as four points—think check-ins or shared photos with geo-location metadata. Even the most devout de-identificationists admit there’s no robust way to anonymize location data.
Experts don’t know how vulnerable data is
In a case study of the meticulously de-identified Heritage Health Prize dataset, which contains the medical records of 113,000 patients, the University of Ottawa professor and de-identification expert Khaled El Emam estimated that less that 1% of patients could be re-identified. Narayanan, on the other hand, estimated that over 12% of patients in the data were identifiable. If an attack is informed by additional, specific information—for example, in an attempt to defame a known figure by exposing private information—it could be orders of magnitude easier to finger an individual within a dataset.
De-identification is hard, and re-identification is forever
De-identifying data is challenging and error-prone. In a recently released dataset of 173 million taxi rides in New York City, it turned out that individual taxis, and even their drivers, could be identified because the hashing (a mathematical function that disguises numbers) of license plate numbers in the data was shoddy.
The thing is, when a person’s anonymity is publicly compromised, it’s immortalized online. That can be an even worse problem than a data breach at a company or web app. When a company’s security is breached, cleanup is messy but doable: the flaw is patched, users are alerted, and life goes on. But abandoning a compromised account is more feasible than abandoning an entire identity.
So should we smash our smartphones, swear off health care, and head for the hills? Not according to the de-identification defender El Emam. He points out that Narayanan did not actually manage to re-identify a single patient in the Heritage Health Prize dataset. “If he is one of the leading re-identification people around,” El Emam says, “then that is pretty strong evidence that de-identification, when done properly, is viable and works well.”
That’s good news for all us human beings who make up big data. But just because the anonymity of big data hasn’t been definitively broken yet doesn’t mean it’s unbreakable.