At its annual developer pilgrimage in California yesterday, Google announced Duplex, a new feature for its virtual assistant. Duplex aims to make the boring phone call a thing of the past. It automatically calls businesses, and can talk to humans on the line to make appointments, set dinner reservations, that kind of thing.
For that to work, Google is making its computers sound a lot more awkward and imprecise. That is to say, more human.
Duplex is different from other “smart” assistants, in that the people primarily interacting with it are not aware that it is a computer. When a user asks Siri or Alexa for something, they are not surprised to get a stilted, robotic response or be totally misunderstood. But a restaurant host who gets a call from Duplex asking to make a reservation is not told that the voice belongs to a recurrent neural network built on TensorFlow Extended, or whatever. For that reason, Duplex’s speech had to sound more natural. If it were obviously a robot, the restaurant would probably just hang up.
Sounding natural is pretty hard for computers, though. Computers need precision. Human language, on the other hand, is full of imprecisions: mistakes, slip-ups, on-the-fly corrections, and sudden pauses. Think of the last time you heard somebody say something like this at a meeting:
“Umm, yeah, so… what I’m thinking is that we go ahead with this but… maybe wait until Tuesday or, I don’t know, maybe even, like… Thursday or Friday? Just so we can, you know, make sure we’ve dotted all the t’s and crossed all the i’s, er, oops, you know what I mean.”
Filler words like “um” and “you know” are present in every language. In fact, they serve a useful function, and can help put listeners at ease. The person at the meeting becomes a bit creepy if they just state, plainly and economically, “We will hand this over Friday to ensure that everything is in order.” That’s exactly the tone that Google hopes to avoid by introducing the “umms,” self-corrections, and other oddities that characterize human speech.
Here is one audio sample of Duplex calling to make a hair appointment, posted on an official blog post announcing the technology.
The robo-voice says “Umm, I’m looking for something around May 3rd.” Then later, “Do you have anything between 10am and, uh, 12pm?” It even fills empty spaces with “Mm-hmm.”
Duplex also inserts the random pauses speakers are familiar with. In another sample, it says, “The… number is… um…” then goes on to handle several interruptions in the process of giving out its phone number.
Google didn’t offer specifics on how the technology was developed. It says it trained a neural network on “a corpus of anonymized phone conversation data.” The model takes into account factors like the appropriate intonation for a given situation, and the speed at which people normally respond to certain prompts, like how you might respond instantly to “hello” but pause before answering, “What time works for you?”
Even with all that, the technology is not complete. Albeit impressive, the audio samples still have moments that suggest something oddly inhuman. And Google has said that Duplex only works in a “narrow” set of situations; it can’t just chat about any random topic. It seems to only work in English.
Beyond the technical aspects, though, Duplex signals an important change in attitude about how algorithms should interact with humans. Technologists have long seen human tendencies like filler words as inefficiencies, waiting to be erased by the precision of machines. That is a good approach when computers are talking to other computers, as is often the case.
But now that computers are regularly talking to humans, they will need to be a bit, er, squishier.