A flawed algorithm led the UK to deport thousands of students

Image: Reuters/Fabrizio Bensch
We may earn a commission from links on this page.

In 2014, Panorama, an investigative news program produced by the BBC, brought a story about student-visa fraud to Theresa May. At the time May was home secretary, a high-level official in charge of immigration and citizenship.

“What Panorama has uncovered is extremely important, it’s very shocking, and I want to do something about it,” she told the show’s reporter.

May had already pledged to make the UK “a hostile environment” for illegal immigrants. But her subsequent actions may have led to the accidental deportation of thousands of legal university students. And the blame rests with the flawed computer system of a US company.

The TOEIC language test

The Panorama story centered on the TOEIC, a widely trusted test of English proficiency. The TOEIC is taken by millions of people every year, and is used by companies and universities around the world as evidence of a person’s English skills. In the past, the test was used to determine whether a prospective foreign student understood English well enough to qualify for a student visa in the UK.

The BBC sent undercover reporters, posing as would-be international students, to try to fake test results. They succeeded in paying fluent English speakers to take the speaking and listening portions of the test for them. And they were able to attend a testing session where the person administering the TOEIC immediately read all of the correct answers aloud to the group, giving each of them a perfect score.

Those revelations led May to ask for closer scrutiny of test results from Educational Testing Services, a company based in Princeton, New Jersey. ETC created and administers the TOEIC, as well as a number of other major standardized tests, including the GRE and the TOEFL.

The analysis delivered to the UK government said there were close to 34,000 “invalid” TOEIC test results. Over 22,000 further results were deemed “questionable.” Acting on those cases, official reports show, the government refused, cut short, or canceled the visas of nearly 36,000 people. Of those, 1,400 people were detained for some time, and another 4,600 were “removed” from the country.

There’s just one problem: The automated system that ETS used to identify fake test results appears to have been flawed, meaning some of those deportations might not have been justified.

Flawed voice-recognition technology

Several students who had their visas revoked appealed.

ETS had tried to identify fraud using voice-recognition software, according to the appeals court decision. They analyzed all of the tests from the UK and tried to identify cases where the same person is speaking on the verbal portion of multiple tests. A single voice taking several tests under different names would likely have been a fraudulent test taker.

But voice-recognition technology is not perfectly reliable. The best voice-identification systems do have very low error rates, only incorrectly identifying someone around 0.5% to 1% of the time. However, the best accuracy rates typically require the system to know who a speaker is ahead of time. When voice recognition is used to replace passwords, for example, a speaker will record his or herself a few times to make sure the software gets a full understanding of the sound of their voice. ETS would not have that kind of data, since each speaker was nominally a different test taker.

The company did not respond to multiple requests for clarification by Quartz about how its system works. But a related case involving TOEIC fraud offers more details: The company was aware of the limitations of automatic voice identification, and in its process, each vocal sample flagged as “suspect” by the algorithm was later checked by two humans at the company. Human analysts only agreed with the computer about 60% of the time. Experts later called upon for that case said that these ETS employees were likely not adequately trained, and estimated the overall rate of false positives around 1%.

Even at 1%, though, several hundred test results marked “invalid” by the technology could have in fact been honest test takers. Other errors could have occurred, too. Some of those who appealed said that the voice recording they were given of their test session was not the correct file. One immigration lawyer put the error rate between 5% and 10%, according to the Guardian. That would put the number of unjust deportations in the thousands.

The government argued that “the proportion of the impugned decisions that was wrong or unfair is very small indeed,” according to the appeals court decision. But, the court adds, “even if that turns out to be the case, the individuals affected by those decisions will have suffered a serious injustice.”

The Financial Times (paywall) spoke to several people who have had their visas cancelled, and concluded that they all “spoke excellent English.”

Policymakers relying on technology they barely understand

The UK isn’t the only country to require language testing for immigration. English-language skills are required for long-term visas to Australia, New Zealand, and Canada. Donald Trump wants to make that the case for the US, too.

But as the May fiasco shows, policymakers worldwide are becoming reliant on computerized decision-making that they do not understand, often with unintended consequences. In the US, for example, software that attempts to predict whether convicted criminals will commit future crimes has been shown to be biased against black people.

In any case, no decision made by an algorithm will be correct 100% of the time. ETS used humans to check the output of their system. But with the computer being right only 60% of the time, analysts were left with tens of thousands of recordings to sift. Such a system could only ever hope to decrease the error rate, not erase it.

It is up to humans to determine how many false positives they are willing to accept. Is 20% acceptable? What about 1%? Along with their visas, former students lost homes, jobs, and futures. When false positives ruin lives, the acceptable rate might be best placed at 0%.