How easy is it to find one patient among 1.3 million anonymized medical records?

If your colleague was in the hospital but didn’t want to tell you why, could you still figure it out? Maybe.

If your colleague was in the hospital but didn’t want to tell you why, could you still figure it out? Maybe.

Publicly available data often contains enough personal information to allow casual acquaintances to locate specific people in medical records, even though the data is considered to be “de-identified.” Patient-level information including hospital name, patient age, race, ethnicity, length of stay, and, detailed diagnoses can all be used to glean information most people think is private.

Let’s say you’re trying to find “Jordan.” You know a few basic things about Jordan. You know Jordan’s gender, race, ethnicity, and generally, Jordan’s age. You also know the county Jordan lives in New York.

New York released a database of hospital stays that includes all of this information about patients.

With these six attributes, there are 13,921 combinations. Three in 10 combinations result in a single unique individual. Still, those combinations are rare; just 4,205 individuals among the 1.3 million, or 0.33%, can be uniquely identified if you know those six things about them.

The structure of a database can make some groups of individuals more likely to be uniquely identified. In our example, the chance of identifying Jordan depends heavily on the demographic groups to which he belongs. If Jordan is a Black, multi-ethnic man, aged 18 to 29, the probability of identifying him can be as high as 27%, a big jump from the 0.51% probability of identifying the average individual. That’s because there are only 15 Black, multi-ethnic men aged 18 to 29 in New York’s database.

Even in cases where multiple records match the profile of Jordan, a comparison of the records themselves could lead you to Jordan: Is it more likely that Jordan was treated for “congestive heart failure” or “mood disorders”? Or, if you knew that Jordan was only away for one week, you could exclude the matching profile with a 50-day hospital stay.

In the US, the Health Insurance Portability and Accountability Act of 1996 (HIPAA) protects the privacy of patients by only allowing personal identifiable information to be used or disclosed under a limited number of circumstances. For example, it would be illegal to make public this database if it had the patient’s name, home address, phone number, or social security number included. HIPAA also regulates that there are no restrictions on the use or disclosure of de-identified health information. Yet, given just a small bit of external information, one can easily link de-identified health records back to individuals using simple techniques.

De-anonymizing data often involves combining multiple databases to extract unrelated information about the same person and piece together a full picture. A entire industry of data brokers has sprung up to take on this work and sell your information to others.

Insurance companies want to know your medical history. Car sellers want to get a hold of your driving habits. Real estate brokers would pay to find out if you just had a newborn and are looking to buy a house.

The example here uses just six factors about an individual to try to identify them. Databases collected by companies can contain hundreds of factors about a person. With the amount of data consumers consciously and unconsciously give away to big tech platforms, de-anonymization has become easier to do in recent years. And not just with simple methods like this.

How easy is it to find one patient among 1.3 million anonymized medical records?

If your colleague was in the hospital but didn’t want to tell you why, could you still figure it out? Maybe.

📬 Sign up for the Daily Brief

Our free, fast and fun briefing on the global economy, delivered every weekday morning.