I sit down across from Laura Cisneros, Doppler Labs’ resident Spanish speaker, and put in the company’s early-prototype earbuds. Between us is a landscape of open circuitry—Bluetooth transmitters and circuit boards. Under the table sits a suitcase packed with two computers and a nest of wiring. Cisneros starts to speak to me in Spanish. About half a second after she begins talking, I hear a translation in English.
It’s difficult to describe hearing someone speak in another language, and immediately understanding what they’ve said. It feels like being kicked in the chest by the future. I can’t help but smile, seeing a crack in the language barrier that no text-based translator or even translated Skype chat could replicate.
The Amazon Echo has proven audio’s power to be the medium through which we interact with our devices, but people still underestimate the power of putting computers in your ears. Smarter hearing could automatically make a conversation in a loud bar more understandable, a commute quieter, or, eventually, a real-time dialogue across languages entirely possible. Earbuds capable of doing all this would make life noticeably and immediately better for any average person—at least that’s Doppler’s pitch.
Launched in 2013, Doppler Labs spent two years on research and development before launching a Kickstarter in 2015 for the Here Active Listening System. Aimed at audiophiles and musicians, the Active Listening System didn’t actually play music, but instead altered the sounds of the world around you. Doppler is now working to ship that system’s successor: Here One, wireless earbuds for regular consumers that intelligently augment hearing, play music, and facilitate interaction with a smartphone’s virtual assistant.
The company is capitalizing on a gap in the market—well-designed, high-quality wireless earbuds that don’t make you look like an early-aughts Bluetooth dad. But Doppler is also vying to redefine the way we hear, and by extension how we interact with the world around us, by giving our ears their own assistants.
The silicon ear
These kinds of grand ambitions wouldn’t have been feasible even five years ago. As we sit down to discuss the computers I’ve just wedged into my ears, Doppler executive chairman and AI guru Fritz Lanman recites a familiar Silicon Valley prayer:
“Thanks to Google for open-sourcing Tensorflow, thanks to Moore’s Law, thanks to the fact that we can collect this data at scale, thanks to the fact that the phones are powerful enough to run some of these models locally so we don’t have to rely on server clusters for everything,” he intones. “From an engineering standpoint, it’s just a really great time to be working on this, because it’s finally possible.”
Doppler’s proposition—that it can sell you ears better than the ones attached to your head—rests on having built a custom brain to control those ears. While the Here Active Listening System focused on tweaking musical performances and sound stages, the Here One earbuds (delayed until February) will be able to cut out or amplify the noise around you. More ambitious prototypes, like the translation system, reside in a moonshot-like lab, showcasing the tech’s long-term potential.
“If you can actually do this right, this really becomes a personal system,” says Doppler CEO Noah Kraft. “Because your ears are different than everyone else’s, and your preferences for what you want to hear is different than everyone else’s.”
To augment or muffle the sounds around us, Doppler’s software must be able to do a few things within milliseconds: Identify a specific noise, understand the characteristics of that noise, and then alter the noise without distorting it. These ambitions all rely on machine learning, a still-nascent field in which algorithms are taught to find patterns in data.
It goes without saying, but there are a lot of sounds in the world. To develop one algorithm to identify every noise wouldn’t make sense—the whole algorithm would have to run every time, which wastes space, memory, and battery life. It would be like sending the whole fire department, every ambulance, and a dozen police officers to discover that a cat got stuck in a tree.
Instead, the Doppler team is developing hundreds of highly specialized algorithms to identify every sound you might encounter during your day—clinking silverware, a car horn, an airplane passing overhead. Some of those algorithms, like sirens, are pretty straightforward (although the team has different algorithms for American and European sirens). Others are more difficult to pin down.
“Babies are ridiculously variable,” Lanman says. “[They’re] wide-band and unpredictable and unique.”
The world’s soundtrack
Despite all the recent successes in artificial-intelligence research, AI still fails when it encounters something it’s never seen (or heard) before. And so like anybody in the deep learning game, Doppler needs data—lots of it.
The team realized early on that the audio they needed—hundreds of thousands of recordings; of crowds in different cities, water, traffic, chatter, church bells, rustling leaves, boat horns—didn’t exist, or at least didn’t exist in the quality and abundance required. So they hired a field audio team to document the sounds of the modern world. To ensure the algorithms know which sounds they’re learning, Doppler is also labeling all of this information manually: They have about 100,000 samples labelled so far.
The six-person team is based out of four major cities—New York, San Francisco, Chicago, and Shanghai—and two members also roam from country to country. Together they’ve amassed more than 1 million recordings from 60-plus locations on five continents.
The field team captures noise in two ways: high-end microphones worn in the ear like earbuds, to document sound the way the brain would hear it; and also with their smartphones. (While the Here Ones each have a processor, right now Doppler relies on users’ smartphones to listen for noises.) With these two sets of audio, Doppler can accurately predict and counteract—or boost—real-world sounds for the buds, and then make those models work on the Here One.
999,999 crying babies
Unlike headphones and even most computers, the Here One itself isn’t the most important part of the user’s experience. What really matters are the algorithms running on the buds. It’s much like Tesla’s model for their self-driving Autopilot feature, which is consistently updated to perform better.
“We want to ship a piece of hardware, and utilize software to make it continuously better,” Kraft says. “The true north here is Tesla.”
Just like Tesla’s Autopilot captures data to make every other car better, Doppler’s algorithms will learn from whatever users are hearing. Every time someone with Here One earbuds activates a noise filter, the buds will capture a snapshot of data about the noises they hear—audio qualities like frequency, bandwidth, location, and time. Since the algorithms rely on these qualities, rather than the raw audio itself, the data furthers Doppler’s understanding of what the world sounds like.
To illustrate the real-time usefulness of such data, Doppler’s machine-learning lead, Jacob Meacham, describes a scenario where two Here One users go to the same restaurant. One goes there regularly, turning down the volume by 20 decibels each time because the restaurant is loud. The other has never been there. Based on the first person’s experiences, the second Here One user will hear a prompt asking if they want to turn the noise down.
Says Lanman, “It’s like contributing to Google Maps or Waze.”
Once Doppler has these products on the street, they’ll have a constant stream of detailed information about the way the world sounds. Not having enough data will cease to be an issue.
“If you’re the millionth customer, then there’s 999,999 other customers who have been collecting baby samples,” says Meacham. “So when you encounter a baby, it’s going to be great.”
The sound of silence
Meacham and his colleague, Matthew Sills, joined Doppler from Palantir, the secretive data-mining company contracted by governments and large corporations. Their work at Doppler, in addition to building models of sound, is to ensure that all this highly sensitive user data never sees the light of day.
The two engineers have worked to decentralize user information from the data itself, which is just incredibly long strings of numbers that describe certain sounds. During our conversation, they theorized about how they might crack their own system.
“You’d have to get access to the data, then the files that tell you how to put them together, and then the situation that the data was used in. Even then you wouldn’t get anything intelligible,” Meacham says.
And they’ve tried. The team recorded the sound of someone speaking, ran it through their system, and then tried to turn it back into understandable speech. Says Meacham, “It definitely wasn’t English.”
White whale
The Here One earbuds are expected to start shipping next month. For $299, you’ll be able to listen in a certain direction—i.e. point your hearing forward to cut out chatter at a busy restaurant, or use “eavesdrop mode” to listen in behind you. You’ll also be able to cut out planes, trains, and automobiles during your daily commute, or turn the volume down on noisy coworkers.
But Doppler Labs’ real white whale is speaker detection: training the earbuds to identify a specific person’s voice and boost it above the din. Imagine your earbuds telling you that your baby is crying, before you can even hear it yourself. Or being able to mute someone entirely.
Lanman says Doppler will involve the user in those types of developments: “It would be creepy if the system just suddenly went, ‘Fritz, do you want to amplify or suppress your wife?'”
Among other challenges, speaker detection requires the AI to work with just a few bits of training data, which is still one of machine learning’s biggest obstacles.
If you have an Amazon Echo, or have trained Siri or Google Assistant to the sound of your voice, you’re familiar with this principle: Say a few very specific phrases, and the AI learns to recognize your voice with more accuracy. But while the Echo typically sits in a quiet home, Doppler’s AI would have to work in a vast range of circumstances: loud bars, concerts, and around a bunch of other people with potentially similar voices.
By bye buy
True speaker detection is far off, as is the automatic translation I experienced. On the hardware side, the processing power for translation is currently too much for the buds or a smartphone to handle. Both conversationalists would have to carry some sort of external language pack, a system that right now would need its own carry-on suitcase.
On the AI side, Doppler’s translation algorithms—though based on systems built by a third party that also builds systems for the CIA—are still working through basic language questions.
“By, bye, buy. Or in Spanish, si and sí with an accent on the ‘i’ are ‘if’ and ‘yes,'” says Jeff Baker, Doppler’s vice president of R&D. “How in the world do you tell those apart?”
That’s a seminal question for Doppler Labs. It’s the crux of their technology, and the raison d’être for earbuds capable of identifying, understanding, mollifying, or augmenting sound. Earbuds that could make sound personalized, and help build a model of how every single user prefers to hear the world.
“It gets to this really interesting philosophical sci-fi debate,” Lanman says. “Isn’t it kind of weird that everyone is going to have their own subjective reality for what the world sounds like?”