Skip to navigationSkip to content
Screenshot/ Google
Fodder for mechanical brains.
FILM SCHOOL

Google is teaching its AI how humans hug, cook, and fight

By Dave Gershgorn

The artificial intelligence that will power future robots and content filters now has a new resource for understanding humans.

Google, which owns YouTube, announced on Oct. 19 a new dataset of film clips, designed to teach machines how humans move in the world. Called AVA, or “atomic visual actions,” the videos aren’t anything special to human eyes—they’re three second clips of people drinking water and cooking curated from YouTube. But each clip is bundled with a file that outlines the person that a machine learning algorithm should watch, as well as a description of their pose, and whether they’re interacting with another human or object. It’s the digital version of pointing at a dog with a child and coaching them by saying, “dog.”

When more than one person is doing something in a video, each person has their own label. That way, the algorithm can learn that two people are needed to shake hands.

Google
Labels in Google’s dataset.

This technology could help Google to analyze the years of video it processes on YouTube every day. It could be applied to better target advertising based on whether you’re watching a video of people talk or fight, or in content moderation. The eventual goal is to teach computers social visual intelligence, the authors write in an accompanying research paper, which means “understanding what humans are doing, what might they do next, and what they are trying to achieve.”

The AVA dataset has 57,600 labelled videos that detail 80 actions. Simple actions like standing, talking, listening and walking are represented more in the dataset, with more than 10,000 labels each. The team says in a research paper that using clips from movies does introduce some bias into their work, since there is a “grammar” to filmmaking and some actions are dramatized.

“It is not that we regard this data as perfect,” the researchers write in the associated paper. ” Just that it is better than working with the assortment of user-generated content such as videos of animal tricks, DIY instructional videos, events such as children’s birthday parties, and the like.”

The paper cites trying to find “top actors of various nationalities,” but does not detail in which ways the dataset might be biased by race or gender, as DeepMind did in a similar video dataset.

The AVA dataset is available now to peruse for yourself.