If AI is going to be the world’s doctor, it needs better textbooks

Illustration of how bad data can introduce bias into an AI system.
Illustration of how bad data can introduce bias into an AI system.
Image: Zack Rosebrugh for Quartz
We may earn a commission from links on this page.

Imagine there was a simple test to see whether you were developing Alzheimer’s disease. You would look at a picture and describe it, software would assess the way you spoke, and based on your answer, tell you whether or not you had early-stage Alzheimer’s. It would be quick, easy, and over 90% accurate—except for you, it doesn’t work.

That might be because you’re from Africa. Or because you’re from India, or China, or Michigan. Imagine most of the world is getting healthier because of some new technology, but you’re getting left behind.

Actually, you don’t have to imagine. This scenario is real. Winterlight Labs, a Toronto-based startup, is building auditory tests for neurological diseases like Alzheimer’s disease, Parkinson’s, and multiple sclerosis. But after publishing their initial research (pdf) in the Journal of Alzheimer’s Disease in 2016, the team hit a snag: The technology only worked for English speakers of a particular Canadian dialect.

“When you actually talk to real doctors and patients, suddenly the things that weren’t apparent to computer scientists working in a basement with data become more evident,” says Winterlight co-founder Frank Rudzicz. For Winterlight, the one major, unforeseen obstacle was language. The data the company had collected, by asking people in Ontario to actually interact with their software, were from native English speakers talking in their mother tongue. A native French speaker taking the test in English might, by contrast, pause while thinking of an English word, or pronounce another word with some uncertainty in their voice. Those tics could be misconstrued for markers of a disease.

“If [people] were pausing or if their vocabulary wasn’t stellar, those were indicative of Alzheimer’s disease in a particular dataset—but it was [also possibly] indicative of the fact that you were learning English now in your 30s or 20s,” Rudzicz said.

AI-powered systems like Winterlight Labs’ are quickly becoming the next frontier in health care, making their way out of the computer lab and into helping real people make real medical decisions. The formula is almost the same across all medical disciplines: Gather a bunch of data on previous patients, and use it to predict what will happen when a new patient steps in the door. Hospitals are being built on the premise that these systems are the future, and startups using AI for health care raised $5 billion in venture capital in 2016 alone, according to a TM Capital analysis.

The data used to train these systems are crucial to ensure accuracy, since research has shown time and again that AI algorithms are slaves to the data from which they learn. When AI was implemented in the US criminal justice system to predict recidivism, for example, it was found to disproportionately suggest black people were more likely to commit future crimes, regardless of how minor their initial offense. If the data are flawed, missing pieces, or don’t accurately represent a population of patients, then any algorithm relying on the data is at a higher risk of making a mistake.

Deep learning, the kind of algorithm behind the current AI boom, is especially susceptible to bias. Deep-learning software works by finding patterns in the data it’s trained on—meaning if men are associated with doctors and women are associated with nurses in a text it analyzes, then the algorithm will apply that bias to answer questions like whether a given person is a doctor or a nurse. Researchers worry that similar biases lurk in health-care data. Consider what could happen when doctors begin relying on AI to diagnose diseases like skin cancer, or determine which drug treatment is best for a serious illness based on biological markers. In products like Google Photos, these biases reaffirm false stereotypes and have detrimental impacts on users, but in the field of medicine, they can be the difference between life and death.

Unfortunately, the medical datasets openly available for use by AI researchers are notoriously biased, especially in the US. It’s not a secret: Health-care data is extremely male and extremely white, and that has real-world impacts. A 2014 study that tracked cancer mortality over 20 years pointed to a lack of diverse research subjects as a key reason why black Americans are significantly more likely to die from cancer than white Americans.

Many expect artificial intelligence to be at the heart of the next iteration of health care, including the US Food and Drug Administration, which has just started to approve commercial AI products for doctors and hospitals to use to augment diagnostic decisions. The first steps look promising: IDx, an AI startup that won the FDA’s first commercial approval for an AI health device (its product detects diabetic retinopathy), says that it worked with the government agency to keep bias in mind throughout the approval procedure.

“We paid close attention to potential subject demographics,” IDx co-founder Michael Abramoff told Quartz in an email. “It was important to both IDx and the FDA that our pivotal study population reflected the diversity of people with diabetes in this country for whom the device was going to be indicated.” He says that even though IDx’s AI software looks for the same biomarkers in retinal images regardless of race, it’s important to make sure any clinical trials or retrospective studies including the software are diverse.

“Bias at any point in data handling for precision medicine can lead to the recapitulation of longstanding health disparities,” wrote cultural anthropologist Kadija Ferryman and social psychologist Mikaela Pitcan in a February report for Data & Society. That means a world where black Americans die more often than white Americans will, at best, remain unchanged; worse, that demographic gap could widen.

Any discussion about bias in AI will be confusing, difficult, and uncomfortable, because bias is hidden and tricky, until it’s obvious and dangerous. That is to say, biased outcomes from a biased algorithm are easier to spot than the biased data fed into the machine.

In health care, these biased outcomes tend to mean one group of people gets better medical treatment than another. These groups are often characterized by gender or race, but sometimes by other traits, like language, skin type, genealogy, or lifestyle. Sometimes these biases can be sussed out early on, but usually only if there’s a diverse group of people thinking about the problem from the beginning of a project. Earlier this year, Charles Isbell, executive associate dean at the Georgia Institute of Technology, told a US Congressional subcommittee that when he was studying for his computer-science PhD in the 1990s, facial-recognition software couldn’t see him because he was black.

“I was breaking all of [my classmates’] facial-recognition software because apparently all the pictures they were taking were of people with significantly less melanin than I have,” Isbell said. Since those facial-recognition algorithms didn’t have any previous examples of darker-skinned faces, the product wasn’t able to categorize an entire group of people. “And so,” Isbell said, “they had to come up with ways around the problem—of me.”

The study of human genomics—which looks at the structure, function, evolution, and mapping of human genomes—is plagued by homogenous data. A 2016 meta-analysis looking at 2,511 studies from around the world found that 81% of participants in genome-mapping studies were of European descent. This has severe real-world impacts: Researchers who download publicly-available data to study disease are far more likely to use the genomic data of people of European descent than those of African, Asian, Hispanic, or Middle Eastern descent.

Shockingly, that 81% is actually an improvement. Alice Popejoy, now a postdoc at Stanford University, was a co-author of the 2016 study. She started the analysis after repeatedly hearing lectures citing a 2009 study that had found that 96% of participants in genome-mapping studies were of European descent. “It’s not just an ethical or moral problem, it’s really a scientific problem,” says Popejoy.

That’s because efforts to mine these flawed datasets for use in clinical settings are proliferating. Deep Genomics, for example, is developing new treatments for Mendelian disorders like Huntington’s disease and cystic fibrosis. Sophia Genetics is integrating with hospitals to analyze patient genomes and give on-site diagnoses. IBM Watson has touted genomics as a kind of silver bullet against cancer, allowing physicians (in theory) to personalize treatment like never before.

But the missing demographic chunks of information in genetic datasets could scuttle these potential AI-based tools by rendering them about as good as guesswork. “The big takeaway is that we don’t know what we don’t know,” says Popejoy. Without knowing about variations between populations, she says, we can’t really say what the implications of those variations are on treatment.

Some researchers are trying to gather more diverse data. The Fred and Pamela Buffett Cancer Center in Omaha, Nebraska, for example, retains its patients’ genomic data to train its AI, doctors from the hospital told Quartz. On a more global scale, consider the International Cancer Genome Consortium (ICGC), an effort to gather genomic markers in patients with cancer from all over the world. Even this presumably universal project is severely limited. It has data from nearly 20 different countries on dozens of cancers, but there are severe imbalances: Many of the countries are only represented in the datasets of a single disease. For example: There are 32 projects in the US (population: 326 million), but only one in India (1.3 billion) and none from the entire African continent (also about 1.3 billion).

Lincoln Stein, chair of the ICGC, says each discovery the organization makes will come with an asterisk. “We can identify a biomarker of prostate-progression risk, but we have to qualify it to say it’s only been tested in white Caucasians,” Stein says. “If you want to apply it to an African-Canadian or an indigenous person, we don’t have the data to say whether it works.”

Some inherent bias in genomic data stems from non-scientific roots. For one, the data simply over-represent people who get sick more. This is called “recruitment bias.” The demographics of the neighborhood where a hospital is located, how it advertises clinical trials, and who enrolls for those trials, all further exacerbate recruitment bias.

There’s also a matter of cost. In 2012, Canada’s Personal Genome Project asked for volunteers willing to provide a DNA sample for genomic mapping; the results would be published on the internet. Only 56 Canadians (out of a population of about 35 million at the time) signed up. Of the study’s 56 initial Ontarian participants, 51 identified as white. That might have something to do with the fact that study participants had to pay about $4,000 for the privilege; in Ontario, minorities make $0.73 for every dollar a Caucasian person makes.

This isn’t a problem we can table for future ethicists to handle. AI researchers are already beginning to present their technology as viable for providing both second opinions and alternative options in medical practices—especially in dermatology.

In early 2017, a Stanford University study claiming an AI system was more accurate than trained dermatologists at diagnosing malignant skin lesions from images scored the cover page of Nature. That paper has been cited over 400 times, and dozens more research teams have proposed similar diagnosis systems.

The problem is datasets of malignant and benign melanoma overwhelmingly feature lighter skin. David Leffell, chief of dermatologic surgery and cutaneous oncology at the Yale School of Medicine, says that’s because white people disproportionately suffer from the disease. A May 2018 study pitting AI against dermatologists claimed similar limitations stemming from a limited amount of demographically diverse skin tones (the authors did not respond to questions about where more diverse data might be found).

This is a valid argument for many seemingly imbalanced medical datasets: They aren’t biased; they simply reflect the population that suffers from a given ailment. The counterargument: Regardless of whether or not disease demographics are skewed, machines still need equal representation to learn.

That might be why other research is now calling into question whether these skin cancer-detecting AIs will actually end up useful in clinical settings, or if they’ll be confounded by the same hurdles as Winterlight Labs’ Alzheimer’s test. A study out of the MIT Media Lab published in February 2018 found that facial-recognition systems from companies like IBM and Microsoft were 11%-19% more accurate on lighter-skinned individuals. They were particularly bad at identifying women of color: The AIs were 34% less accurate at recognizing darker-skinned females compared to lighter-skinned males.

Very few dark-skinned people were in the original dataset; even fewer were female. “Without a dataset that has labels for various skin characteristics such as color, [hair] thickness, and the amount of hair, one cannot measure the accuracy of such automated skin cancer-detection systems for individuals with different skin types,” write MIT researcher Joy Buolamwini and Microsoft Research’s Timnit Gebru in the 2018 study. In other words, a person’s body-hair type can skew an AI’s assessment of whether or not he or she has skin cancer.

Shortly after the 2017 Nature paper came out, co-author Brett Kuprel told Quartz that the only datasets they could find were made up mostly of lighter skin samples. As a result, the study was biased towards Caucasian skin. Sebastian Thrun, a Stanford University professor, co-author of the paper, and a legendary name in AI research (for being among the first to prove self-driving cars were possible) told Quartz via email that the team did not test for variation in skin tone. “I agree much more work is needed before we can confidently recommend such a technique for field use,” Thrun wrote.

The Stanford paper drew a large portion of its data from the International Skin Imaging Collaboration. That tool, which features about 13,000 publicly available skin-lesion images taken with medical dermoscopes, is meant to standardize how images of human skin appear in clinical settings. But the ISIC database doesn’t yet feature skin-type labels. Stephen Dusza, an epidemiologist who worked on a Memorial Sloan Kettering Cancer Hospital team that helped compile the ISIC database, says that without the skin-type labels, there’s no way to research the correlation of skin type and skin color, unless trained dermatologists manually label thousands of images. Dusza says ISIC will release the labels once they’re sure it’s “scientifically relevant” to do so.

MIT’s skin-cancer AI system and others like it are only just beginning to leave the lab, so we don’t know what kind of impact they will have in clinical settings, or what frameworks government regulators and health-care accrediting bodies will put into place to regulate them.

There is some precedent for the government to step in to ensure diversity during the data-gathering phase of new health policies and medical treatments. In 1993, the US Congress compelled the National Institutes of Health to bring more diversity to the medical studies it funded. It’s not clear Congress or the NIH can solve this problem alone; more than 20 years later, 81% of genomics research is still from those of European descent. And furthermore, a 2015 study found that 2% of the more than 10,000 NIH-funded cancer studies include enough minority groups to be statistically significant. The study points to multiple potential causes, including inadvertent incentives in the NIH’s funding structure, but the simplest is a lack of diversity in the medical field itself, and the propensity for non-white researchers to be funded less often.

Today, the mainly white, mainly male community of computer scientists trying to build the future of medicine is only just beginning to wake up to the idea that they need to feed their machines with data diverse enough to represent the entire patient population affected by the disease they are trying to fight. If, that is, they want to win the battle.

Some companies focused on AI are already aware of these issues, and are taking steps to remedy them. For example, despite getting hammered by Bulowami and Gebru’s paper on bias in facial recognition, IBM does actively research mitigating bias in machine learning, and has since said it would update its facial-recognition tool to be more inclusive.

Meanwhile, at Winterlight Labs, Frank Rudzicz is still working to collect data on other speech patterns and languages. He’s very much aware that the data he has—and his colleagues in the field have—aren’t nearly good enough to solve the health-care problems they want to solve. “The kinds of things we’re doing in machine learning works well if you’re trying to show off to computer scientists,” he says, “but, in practice, it doesn’t.”

Correction: A previous version of this story incorrectly referred to the Fred and Miranda Buffett Cancer Center. The correct name of the hospital is the Fred and Pamela Buffett Cancer Center. 

This story is one in a series of articles on the impact of artificial intelligence on health care and medicine. Click here to sign up to get alerted when new stories are published.