“Voice skins” will make the internet a freer—but more dangerous—place

You could be anyone and so could anyone else.
You could be anyone and so could anyone else.
Image: Reuters/David W Cerny
We may earn a commission from links on this page.

For more on voice skins, check out the seventh episode of our Should This Exist?  podcast, which debates how emerging technologies will impact humanity.

Who would you be if you could be anyone? If your physical self didn’t limit the possibilities and your social circumstances didn’t impose any rules of behavior, would you still choose to be you? Or might you try to be some other kind of human?

These are questions that the internet has allowed people to explore like never before. We choose avatars that can allow us to boldly go where our own real life persona would not and that’s proved positive for some people who find a voice online.

It can also be dangerous, as one Jewish mother discovered when her 13-year-old son became the moderator for an alt-right subreddit filled with hate speech: Users had no clue he was a Jewish kid from a liberal family experimenting with rebellion. They took him seriously, which only fueled the boy’s passion for online debate and made him more fully engaged in the ideology (until he met these friends in real life).

For many, anonymity frees us from accountability. Untethered to our physical selves, free from our names and the possibility of injury to reputation, we can say and do things we might not otherwise because the societal consequences are eliminated. As Oscar Wilde put it in 1891, “Man is least himself when he talks in his own person. Give him a mask, and he will tell you the truth.”

That’s got both upsides and downsides. Anonymity allows for creativity and play, exploration, and perhaps, as Wilde said, an honesty that’s impossible in other contexts. But it also allows for dangerous deception, abuse, fraud, and crime.

As technology advances, it will allow even more avenues for exploration, and for deceit. There will be room for creativity, and also plenty of dangers.

Second skin

Gamers have been playing with identity questions for some time. In real life, they are whatever they are, but when playing Fortnite, say, they adopt characters that need not resemble their real identities at all. Visual “skins” allow players to dress up their characters in the physical attributes and styles they desire. A man can be a woman. A kid can be an adult. The character is like a blank doll, and skins allow players to dress it in the hair, face, body, and clothes they choose.

The option to try being someone else is increasingly available beyond gaming, too. The social media platform Snapchat, for example, recently added a new gender-changing filter, which works better for men aiming to look like women than the reverse. The new feature is incredibly popular, and seems to be driving Snapchat downloads, which may be a reflection of just how much fun people think it is to experiment with an alternate self.

Now, voice technology is catching up to more advanced visual tools. In February, a Cambridge, Massachusetts-based company called Modulate, raised $2 million in seed financing to create audio skins that gamers can use to customize the voices of their avatars.

Modulate uses a type of machine learning technique called generative adversarial deep neural networks (GAN), along with conventional audio processing techniques, to “teach” the software how to speak in a realistic voice. The machine “learns” how a user is speaking and modulates between the target voice they chose and the real voice they use. The more voices that are inputted into the software, the better it gets at learning, and the more authentic the modulated voices will be (for now, the samples on the Modulate website still sound synthetic).

These audio skins will make it possible for players to more completely inhabit their characters. If a female gamer chooses a male avatar to inhabit in the game, she can modulate her voice to sound more masculine, all while maintaining the inflections and tone of her actual speech. Players can also adopt accents for their characters.

In some ways, this is cool. It’s the obvious next step in making gaming feel as real as possible. The whole point of play is to engage in an alternate reality, and the more realistic this fantasy world seems, the more fun it might be.

But of course gamers don’t leave the real world behind when they play online. Women, for example, tend to be harassed more than male players, so this tool might help more women feel safe in these online communities. Female gamers already mask their identity to avoid harassment by avoiding verbal communication with other players, according to a 2018 study of 270 women players in the Journal of Mental Health and Addiction.

Voice skins could change that. But they’d also help reenforce a problematic social environment in which being female is deemed inferior. A woman who chooses a male character and voice simply because she wishes to be free from harassment may have an easier time, but it still leaves the underlying societal problem intact. The message for female gamers then becomes that they have to pretend to be guys if they want to fit in.

That plays into intolerance rather than addressing its roots, which means women will continually face the same problems. Dealing with the root issues, on the other hand, offers some chance that future generations of players won’t face the same pressures or embrace the same stereotypes. It’s fair to argue that a female gamer might not want to bear the burden of smashing the patriarchy every time she logs on to play, but the problems won’t be solved simply by donning male skins.

Additionally, the anonymity that skins create means players are disconnected from accountability. A gamer might talk a friendly and tolerant game as themselves, in an online forum say, and still play as a character who spews hate or bullies others in the context of play because they are free to be anyone, even a jerk.

That’s not to say that skins shouldn’t be allowed, or that the self is best when imprisoned in fear about consequences. We must simply acknowledge that each new tool that frees us from the rules of physics raises questions, too. And often, the people who are making these tools are also most aware of the dangers and most vocal about raising the issues.

Deep fakes

The other practical and pressing issue that arises from the development of voice skin technology is that its use won’t necessarily remain confined to gaming and entertainment, even if the founders of Modulate can ensure that their particular tools do.

“Modulate is about creativity and freedom, not impersonating others,” co-founder Carter Huffman has said in a statement about the software. “We’ve built ethical safeguards into our company from the ground up, from how we distribute our technology, to how we select the voice skins to offer, to watermarking our audio for detection in sensitive systems.”

But voice fraud is on the rise, which means the better technologists get at masking actual voices, the better fraudsters will get at deception. By pretending to be someone else on the phone, a voice fraudster can call a bank, brokerage, or insurer and access private information. The ability to feign another’s identity with voice is easier than ever with new audio tools and increased reliance on call centers that offer service (as opposed to going to the bank and talking to a teller, say). In 2018, Pindrop, a company that creates security software and protocols for call centers, reported a 350% rise in voice fraud between 2013 and 2017, primarily to credit unions, banks, insurers, brokerages, and card issuers.

Beyond the possibility for financial crimes, there’s the danger of increasingly effective catfishing, the practice of creating a fake online identity in order to target victims for bullying, harassment, or crime. As technology advances, the extent of this kind of deception will, too.

Information apocalypse

On a broader level, the new tools for deceit can have incredible influence on society. Take, for example, Russian meddling with US presidential elections in 2016. By pretending to be Americans online, Russian agents were able to disseminate fake information on social media and alter citizens’ perception of key issues before they went to vote. They did this without the benefit of voice technology, but as the tools advance, so do the chances of bad actors coming up with ever-more-sophisticated schemes to undermine or advance regimes of their choosing.

Deepfakes are already a problem. These are doctored videos made by superimposing imagery from existing videos to create a new one that gives the impression that an influential person is saying something that they never said, or doing something they never did. Improved tools make it easier than ever to disseminate misinformation that is harmful because it seems like the information comes from a trusted authority.

Pair deepfakes with voice skins—visual and audio AI that rely on the same machine learning techniques—and there’s the potential for deception that will be very difficult to distinguish from authentic content.

The problem of spotting fakery has been debated since the advent of Photoshop, software primarily used to edit images. Image editing tools erode the truth many people expect to find in images because a doctored image no longer reflects an actual captured moment. Video and audio make the deception more profound, taking the ability to fake reality even further.

In a “post-truth” world already awash in fake news, experts fear an “information apocalypse.” Lars Buttler, CEO of the AI Foundation, a software company which creates “reality defense” mechanisms, like a web browser plug-in that can alert users to potential fakery, told The Verge last August, “We felt we were at the threshold of something that could be very powerful but also very dangerous. You can use these tools in a positive way, for entertainment and fun. But a free society depends on people having some sort of agreement on what objective reality is, so I do think we should be scared about this.”

Buttler and his colleagues are worried that if they don’t develop tools that enable us to distinguish real content from fake, we will develop “reality apathy,” becoming indifferent to the distinction between what’s real and what is not. As Buttler wrote in a post in Medium last August,”If the authenticity of media can no longer be trusted, we can no longer agree on shared objective reality, a critical necessity for a free society.”

For more on voice skins, check out the seventh episode of our Should This Exist?podcast, which debates how emerging technologies will impact humanity.