Transcribing a conversation between two humans is one of those tasks that’s deceptively difficult for machines to tackle. Even if the audio file is high quality and doesn’t have any background noise, the algorithm needs to contend with different voices, interruptions, hesitations, corrections, and a litany of common conversational nuances.
A new paper from Microsoft Research claims to slightly beat human-level transcription of conversation, even when the human transcript is double-checked by a second human for accuracy. The team doesn’t attribute this achievement to any breakthrough in algorithm or data, but the careful tuning of existing AI architectures.
To test how their algorithm stacked up against humans, first researchers had to get a baseline. Microsoft hired a third-party service to tackle a piece of audio for which they had a confirmed 100% accurate transcription. The service worked in two stages: one person types up the audio, and then a second person listens to the audio and corrects any errors on the transcript. Based on the correct transcript for the standardized tests, the professionals had 5.9% and 11.3% error rates.
After learning from 2,000 hours of human speech, Microsoft’s system went after the same audio file—and scored 5.9% and 11.1% error rates. That minute difference ends up being about a dozen fewer errors.
Microsoft’s next challenge is making this level of speech recognition work in noisier environments, like in a car or at a party. This implementation is crucial for Microsoft, and goes well beyond just transcription.
This work is another step for Microsoft towards making conversation with a computer seem smooth and effortless. If the computer can’t understand what a person is saying, its task to complete that command or answer that question is made much more difficult. This is foundational for everything else Microsoft wants to achieve. Earlier this year, Microsoft CEO Satya Nadella claimed that artificial intelligence is the future of the company, and conversation would be its cornerstone.
Despite its success, the automatic system differs from human transcribers in one big way: It can’t understand little conversational nuances like “uh.” The sound “uh” can either be used to hold someone’s place in a conversation while they think, or signal that the other person should keep talking, as with an “uh-huh.” Professional human transcribers are able to note something as either a hesitation or affirmation, but these little cues are lost on the machine, which is unable to comprehend any meaning as to why each sound was made.