Oxford University’s lip-reading AI is more accurate than humans, but still has a way to go

Read my lips.
Read my lips.
Image: DeepMind/Oxford
We may earn a commission from links on this page.

Even professional lip-readers can figure out only 20% to 60% of what a person is saying. Slight movements of a person’s lips at the speed of natural speech are immensely difficult to reliably understand, especially from a distance or if the lips are obscured. And lip-reading isn’t just a plot point in NCIS: It’s an essential tool to understand the world for the hearing-impaired, and if automated reliably, could help millions.

A new paper (pdf) from the University of Oxford (with funding from Alphabet’s DeepMind) details an artificial intelligence system, called LipNet, that watches video of a person speaking and matches text to the movement of their mouth with 93.4% accuracy.

The previous state of the art system operated word-by-word, and had an accuracy of 79.6%. The Oxford researchers say the success of their new system is a different way of thinking about the problem: Instead of teaching the AI each mouth movement using a system of visual phonemes, they built it to process whole sentences at a time. That allowed the AI to teach itself what letter corresponds to each slight mouth movement.

To train the system, researchers showed the AI nearly 29,000 videos labelled with the correct text, each three seconds long. To see how human lip-readers would handle the same task, the team recruited three members of the Oxford Students’ Disability Community and tested them on 300 random videos similar to those they fed their AI. Those humans had an average error rate of 47.7%, while the AI’s was just 6.6%.

Despite the success of the project, it also reveals some of the limits to modern AI research. When teaching the AI how to read lips, the Oxford team used a carefully curated set of videos. Every person was facing forward, well-lit, and spoke in a standardized sentence structure.

For example: “Place blue in m 1 soon” was one of the standard three-second phrases used in the training consisting of a command, color, preposition, letter, number from 1-10, and an adverb. Every sentence follows that pattern. So the AI’s extraordinary accuracy might have to do with the fact that it was trained and tested in extraordinary conditions. If asked to read the lips of random YouTube videos, for instance, the results would probably be much less accurate.

Some of the most interesting public discourse about AI papers happens afterwards on the vast expanse of Twitter. When other researchers pointed out that using such specialized training videos weren’t applicable to real-world results, author Nando de Freitas defended the results of his paper, noting that other video sets the team tried were too noisy. The other videos they tried were each too different from the last for the AI to draw meaningful conclusions—meaning a perfect data set just doesn’t exist yet. De Freitas wrote he was confident that given the correct data the AI has shown that it would be up to the task.

According to OpenAI’s Jack Clark, getting this to work in the real world will take three major improvements: a large amount of video of people speaking in real-world situations, getting the AI to be capable of reading lips from multiple angles, and varying the kinds of phrases the AI can predict.

“The technology has such obvious utility, though, that it seems inevitable to be built,” Clark writes. Teaching AI to read lips is a base skill that can be applied to countless situations. A similar system could be used to help the hearing-impaired understand conversations around them, or augment other forms of AI that listens to video sound and rapidly generate accurate captions.

Correction: A previous version of this article inaccurately inferred that LipNet is an Alphabet DeepMind project. In fact, while the project is funded in part by DeepMind, all IP resulting from LipNet belongs solely to Oxford University.