The spread of computing to every corner of our physical world doesn’t just mean a proliferation of screens large and small—it also means we’ll soon come to rely on mobile computers with no screens at all. “It’s now so inexpensive to have a powerful computing device in my car or lapel, that if you think about form factors, they won’t all have keyboards or screens,” says Scott Huffman, head of the Conversation Search group at Google.
Google is already moving rapidly to enable voice commands in all of its products. On mobile phones, Google Now for Android and Google’s search app on the iPhone allow users to search the web via voice, or carry out other basic functions like sending emails. Similarly, Google Glass would be almost unusable without voice interaction. At Google’s conference for developers, it unveiled voice control for its Chrome web browser. And Motorola’s new Moto X phone has a specialized microchip that allows the phone to listen at all times, even when it’s asleep, for the magic word that begins every voice conversation with a Google product: “OK…”
There’s nothing new about voice interaction with computers per se. What’s different about Google’s work on the technology is that the company wants to make it as fluid and easy as keyboards and touch screens are now. That’s a challenge big enough that, thus far, it has kept voice-based interfaces from going mainstream in our personal computing devices. And in cases when they are in use, such as interactive voice response systems designed to handle customer service calls, they can be frustrating.
“What we’re really trying to do is enable a new kind of interaction with Google where it’s more like how you interact with a normal person,” says Huffman. To illustrate, he picks up his smartphone and says “How far is it from here to Hearst Castle?”
Normally, getting an answer to such a seemingly simple question would require googling “Hearst Castle,” clicking on a map, and typing in your own address. But Huffman’s phone gets the answer right on the first try—a neat illustration of how voice commands can save time and effort. In a way, it’s part of the natural progression of convenience in computer interfaces: 10 years ago writing an email required walking over to a computer, five years ago we could whip out our phones, and in the near future we’ll simply start talking.
To achieve this kind of apparent simplicity, the Conversation Search group has to muster everything that Google already knows about the real world. That’s because, as anyone who has discovered that half the battle of learning a foreign language is absorbing the culture in which it’s embedded, the meaning behind language is always dependent on context.
“One thing that really helps us is the base of all the core relevance and ranking work that the Google search engine is famous for,” says Huffman. Part of that “relevance” is the Google Knowledge Graph, a database of people, places and things that allows Google to know, for example, that when you ask it for “Cruise movies” you are probably asking for the films of Tom Cruise, rather than “crews movies” or any of a number of other possibilities.
This context doesn’t just make Google’s voice interfaces usable—some day, it could make them even better than humans. “Today, automatic speech recognition is not as good as people, but our ambition is, we should be able to be better than people,” says Huffman. In order to achieve that, Google will leverage the intimate knowledge it has of its users.
“In some sense Google has a lot of context that [a human transcriptionist] doesn’t have,” says Huffman. “We know where you are based on your phone’s location and there is some context around what you’ve been talking about lately. Therefore that should help us understand what kinds of things you might be saying.”
The future of Google’s voice interfaces isn’t just accurate interpretation of commands, but real interaction—hence the “conversation” part of Huffman’s Conversation Search group. One trick Google’s voice interface can already do is understand pronouns like he, she and it. “You can ask yourself why in language do things like pronouns exist—well, they exist because it lets us communicate faster than we do without them,” says Huffman.
To demonstrate, Huffman follows up his question about how far it is to Hearst Castle with the sentence “give me directions,” which doesn’t even include the pronoun “it,” but his phone begins rattling off directions in its tinny computerized voice, anyway.
All of this is, of course, a demonstration laid out in advance for my benefit. And like any other nascent technology it doesn’t always work perfectly. At other points in Huffman’s demo, his smartphone fails to understand the pronouns he’s using. One reason for that, he notes, is that Google’s voice interface “forgets” the subject of any conversation with it after a certain amount of time. Just as in natural conversation, it has a limited attention span.
In conversation, a human being who has forgotten the referent for a pronoun like “it” might ask his or her companion what he or she is talking about. Google’s conversation search can’t do that yet, but his team is working on it, says Huffman. Already, Google’s regular search results perform a version of this “can you clarify?” task by suggesting search terms and providing other disambiguating links at the top of search results. Eventually, Google’s voice search will do the same: “Did you mean the movies of Tom Cruise…” or, given your search history “were you referring to the movies of Penelope Cruz?”
At this point, voice commands are a little-used feature of most people’s everyday interactions with computers, if we’re using them at all. Between the present and a future in which we are reliably interacting with computers by voice alone, there are a number of challenges, some of them fundamental to what we think of as a computer interface.
One challenge to voice control is simply reliability and error correction. For example, as Google Glass transcribes your words for an email, text or social media update, you can actually see the ghostly words hovering in your field of view, but how does an interface that relies solely on our ears accomplish the same? Does it read our messages back to us?
Another issue is that current visual computer interfaces limit our options in ways that can make them easier to use. For example, in graphical user interfaces we can find out what a program can do by clicking on all of its buttons and looking under its menus. But commanding a computer by voice is more like the old model of interaction with a computer—the command line. It’s a potentially powerful interface—Huffman imagines a future in which we might even communicate with our computers via a verbal short-hand—but it would require that humans learn a whole new way to control computers, and learn anew the capabilities of all the software that might be used in this way.
Ultimately, none of these issues may prove as insurmountable as the ones that Google has already overcome by virtue of its enormous search database, knowledge of the real world, cloud computing infrastructure and army of Ph.D.s who work on voice recognition and natural language processing. Currently, the everyday magic of understanding voice commands is carried out almost entirely in the cloud, because processing human speech is difficult enough that even a sophisticated smartphone doesn’t have the processing power to do it at a high enough level of reliability.
That means voice commands issued to Google’s hardware and software are recorded, shot into the cloud and parsed into next steps, rather than being handled by the device itself. “For speech recognition, it’s a very data intensive thing,” says Huffman. “We use giant neural network things that are spread across many servers.” Which means that when we talk to our phones, there really is someone listening to our every command—just not an intelligence we’d recognize as human.