Sometimes we hear something and can’t believe it. That happened to me today.
New research from Barcelona’s Pompeu Fabra University has trained an AI to sing better than I can after listening to just 35 minutes of audio. Speech generation has been drastically improving over the last few years, with research firms like Baidu and DeepMind constantly one-upping each other in who can make the most realistic robot voice.
But there’s something about this singing audio that actually tricked me. Listen for yourself.
It sounds like just some guy singing a little out of tune. He’s no Mel Tormé—wait, it’s not even a “he”—but it sings better than me.
Merlijn Blaauw, co-author on the paper, tells Quartz that the team used an older approach on top of DeepMind’s new WaveNet voice generator. Instead of learning from just from the audio itself, they analyzed the audio and broke it into components: pitch, timbre and aperiodicity (the “breathy component of the voice”).
“And by separating pitch and timbre components, we can easily manipulate pitch to match any desired melody,” Blaauw wrote in an email.
The AI can also learn from smaller amount of audio when the data is broken down, which makes me wonder how long until an interview with a musician or a solo voice track is all it takes for someone to copy their voice. (Couple that with face-stealing technology and you’ve got an AI cover band on your hands.)
Blaauw was actually surprised at how well the neural network, a statistical approximation of how the brain learns, was able to use these components to understand how the voice should sound. They were able to gauge the network’s understanding by trying to have it mimic a softer voice—and the network applied things it learned from the regular voice, like notes and phonetic transitions, to the softer one.
Here’s the soft voice:
And a more “powerful” one:
You can listen to the rest of the recordings, including acapellas, here.