If you want to talk about something important or sensitive, odds are that conversation will happen over the phone rather than email or text. The sound of someone’s voice is an important part of trusting them—but the ability to trust the voice on the other end of that call is human might change.
Google DeepMind announced a new speech generation method it calls WaveNet, which could bring artificial intelligence closer than ever to indiscernibly mimicking human speech. The algorithm can easily learn different voices and even generates artificial breaths, according to a DeepMind blog post. DeepMind, the London-based AI firm acquired by Google in 2014, broadly works to “solve intelligence,” a goal that spans health, data center energy efficiency, and ancient Chinese board games.
Building audio from scratch is difficult because the data is so dense. Each second of digitally recorded speech (at the quality of a typical phone call) is made of 16,000 different bits of data, called samples. It takes a lot of power to recreate information that granular. Google’s current method of doing this uses AI, but it’s generated in a chunkier way.
Imagine you want to make a cup. You can build it with Legos, or use clay. In the Legos scenario, each building block rests on each other to make a complete structure, but each was distinctly created. That’s like Google’s method now, which uses recurrent neural networks.
You can listen to that here:
WaveNet works more like the clay. It’s rolled into a long string and coiled, every part existing in the context of what comes before it. This approach uses convolutional neural networks, where previously generated data is considered when producing the next bit of information. By taking advantage of this continuous generation, DeepMind was able to cut the gap between quality of human and machine generation speech by 50% in blind tests.
Listen to WaveNet say the same thing here:
For WaveNet to understand what human speech sounds like, it first had to listen. DeepMind researchers fed the algorithm 44 hours of speech, by 109 different English speakers. Results showed that after learning from that wide range of different speakers, the algorithm could model any single speaker that it learned from. WaveNet could even include that speaker’s idiosyncrasies, like breaths and audible mouth movements.
Researchers also found that if they fed the algorithm classical music instead of speech, the algorithm would compose its own songs.
Maybe it should stick to speech, and leave composing to the humans for now.