MIT and Google researchers have made AI that can link sound, sight, and text to understand the world

If we ever want future robots to do our bidding, they’ll have to understand the world around them in a complete way—if a robot hears a barking noise, what’s making it? What does a dog look like, and what do dogs need?

By Dave Gershgorn3 min readUpdated July 20, 2022

Add QZ to Google

AI research has typically treated the ability to recognize images, identify noises, and understand text as three different problems, and built algorithms suited to each individual task. Imagine if you could only use one sense at a time, and couldn’t match anything you heard to anything you saw. That’s AI today, and part of the reason why we’re so far from creating an algorithm that can learn like a human. But two new papers from MIT and Google $GOOGL explain first steps for making AI see, hear, and read in a holistic way—an approach that could upend how we teach our machines about the world.

The essential business news, delivered fresh every morning.

Join 500,000+ readers who start their day with Quartz.

By subscribing, you agree to our Terms of Service and Privacy Policy.

That word Aytar uses—aligned—is the key idea here. Researchers aren’t teaching the algorithms anything new, but instead creating a way for them to link, or align, knowledge from one sense to another. Aytar offers the example of a self-driving car hearing an ambulance before it sees it. The knowledge of what an ambulance sounds like, looks like, and its function could allow the self-driving car to prepare for other cars around it to slow down, and move out of the way.

To train this system, the MIT group first showed the neural network video frames that were associated with audio. After the network found the objects in the video and the sounds in the audio, it tried to predict which objects correlated to which sounds. At what point, for instance, do waves make a sound?

Next, the team fed images with captions showing similar situations into the same algorithm, so it could associate words with the objects and actions pictured. Same idea: first the network separately identified all the objects it could find in the pictures, and the relevant words, and then matched them.

The network might not seem incredibly impressive from that description—after all, we have AI that can do those things separately. But when trained on audio/images and images/text, the system was then able to match audio to text, when it had never been trained to know which words correspond to different sounds. Researchers claim this indicated the network had built a more objective idea of what it was seeing, hearing, or reading, one that didn’t entirely rely on the medium it used to learn the information.

One algorithm that can align its idea of an object across sight, sound, and text can automatically transfer what it’s learned from what it hears to what it sees. Aytar offers the examples that if the algorithm hears a zebra braying, it assumes that a zebra is similar to a horse.

“It knows that [the zebra] is an animal, it knows that it generates these kinds of sounds, and kind of inherently it transfers this information across modalities,” Aytar says. These kinds of assumptions allow the algorithm to make new connections between ideas, strengthening its understanding of the world.

Google’s model behaves similarly, except with the addition of being able to translate text as well. Google declined to provide a researcher to talk more about how its network operated. However, the algorithm has been made available online to other researchers.

Neither of these techniques from Google or MIT actually performed better than the single-use algorithms, but Aytar says that this won’t be the case for long.

“If you have more senses, you have more accuracy,” he said.