Big data promises to harness information and deliver a more efficient economy, better pricing, and products that better suit our needs. But its proponents still don’t know what to do when data justifies behavior that challenges our ethics.
Data inference is sometimes inherently racist. Racism is projecting qualities onto an entire group of people. To a large extent that is exactly what statistics does. In an academic setting, that didn’t seem troubling because it helped us understand what drove disparities in health, wealth, income, and education and what we can do to fix them. But now that we are using data for commercial purposes or for justice, the stakes are higher because data use can justify racist behavior and impact people’s lives.
Recently, ProPublica dug into algorithms that predict how likely criminals in the US are to reoffend or reoffend. The models place a higher probability that black Americans will reoffend. At first sight, it’s not surprising because black Americans have a higher rate of recidivism. But what’s troubling is the models are biased. According to ProPublica’s analysis, the algorithms assign a false positive to blacks (a higher probability they will reoffend than what shows up in the data) and a false negative to whites. ProPublica declared the algorithms “racist.” The consequences are serious because the models are used in each stage of the criminal process by local judges and contribute to determining everything from bond amounts to sentencing.
It is not clear if the algorithm is overtly racist or just poorly specified. The not-for-profit who built the algorithm, Northpoint, did not disclose what variables are used in their algorithm. It is not clear race is a variable, but the firm admits their model includes variables correlated with race, like education and employment. Based on ProPublica’s analysis, the model’s forecasts aren’t much more likely to accurately predict whether someone will reoffend than a coin flip.
Orwellian concerns aside, a model with such low predictive power shouldn’t be used to determine anyone’s prison sentence. But even if the model were accurate it might still be racist because it is estimated from data that reflects a racist justice system and society that contribute to higher black recidivism rates. Even if you don’t include race, variables correlated with race will predict higher rates of recidivism. The model will also then predict higher rates for black Americans.
Now that data is being used more widely, there are more examples of data leading to racist outcomes. Recently, algorithms that use variables correlated with race resulted in black Americans offered unfavorable credit terms. There are regulations that prevent lenders from denying credit based explicitly on race or gender. But now that financial firms are able to use big data, if the data are racist (if say black Americans are associated with higher credit risks for a variety of unfair institutional reasons) they may end offering different credit terms to people with different races.
Using demographic data is common in the insurance industry. Women pay more for annuities because they live longer; younger drivers tend to pay more for auto insurance because they tend to get into more accidents. It may seem unfair, but it means insurance companies can offer prices that are attractive to a wider range of customers. You can’t price insurance based on race (or on gender in Europe), but economists sometimes argue insurers should be able to use all the available data. Hispanic Americans have longer life expectancies than black Americans. That means if a black American wants to buy an annuity he has to subsidize a Hispanic. He might be put off by the high price, and not buy it and end up carrying more risk.
Big data fans often argue data and algorithms can improve the human experience and displace human judgement. But there is no good, scientific way to compensate for racist data. It has the potential to bring back the days when Americans were denied credit based on race or where they lived. At worst using racist data in criminal justice or finance can further entrench the institutional racism that already exists in the data.