The secret to training AI might be knowing how to say “good job”

High-five for machine-human collaboration!
High-five for machine-human collaboration!
Image: Reuters/ Christinne Muschi
We may earn a commission from links on this page.

It’s tough to appreciate how efficient at learning humans really are. From just a few experiences, we can figure out complex tasks like learning to walk or becoming pros at the office coffee machine (roughly of equal importance).

But we haven’t been able to give machines that same gift. Reinforcement learning, a promising sector of AI research where algorithms test different ways to accomplish a task until they can reliably get it right, is one method used to get machines to learn by doing.

The field’s biggest problem: What’s the best way to tell AI it has done something right?

This week, research trying to answer that question was published by major outfits in Silicon Valley: A joint venture between Alphabet DeepMind and Elon Musk-funded OpenAI, as well as separate work from Microsoft-owned Maluuba.

The two papers represent different perspectives on how machines of the future might learn. OpenAI and DeepMind’s work suggests that humans may be the best shepherds for fledging AI, guiding the way it learns to ensure its safety. Maluuba takes a new look at an idea AI researchers have hammered at for years, trying to find a way for its algorithm to better understand its failures and successes without human intervention.

The DeepMind and OpenAI research, posted June 13, has humans watch two videos of a 3D object trying to do a front flip. The human chooses the video where the algorithm made the better attempt—but there’s a secret! The algorithm has already tried to predict which attempt was better, so the human not only shows a better way to do the task, but gives a nod to how humans perceive the better attempt.

Much reinforcement learning research from DeepMind and OpenAI in the past has focused on video games, where there’s a clear goal: Get more points. This new research has an objective goal (do a front flip), but the human judgement can be subjective. OpenAI researchers say this idea could improve AI safety, because future algorithms would be able to align themselves with what humans think are correct and safe behaviors.

Microsoft’s Maluuba takes a different approach to reinforcement learning, and used it to beat the game Ms. Pac-Man, according to research published June 14. The team quadrupled the previous high score on the game (by human or machine), achieving the maximum number of points possible.

When the agent (Ms. Pac-Man) starts to learn, it moves randomly; it knows nothing about the game board. As it discovers new rewards (the little pellets and fruit Ms. Pac-Man eats) it begins placing little algorithms in those spots, which continuously learn how best to avoid ghosts and get more points based on Ms. Pac-Man’s interactions, according to the Maluuba research paper.

As the 163 potential algorithms are mapped, they continually send which movement they think would generate the highest reward to the agent, which averages the inputs and moves Ms. Pac-Man. Each time the agent dies, all the algorithms process what generated rewards. These helper algorithms were carefully crafted by humans to understand how to learn, however.

Instead of having one algorithm learn one complex problem, the AI distributes learning over many smaller algorithms, each tackling simpler problems, Maluuba says in a video. This research could be applied to other highly complex problems, like financial trading, according to the company.

But it’s worth noting that since more than 100 algorithms are being used to tell Ms. Pac-Man where to move and win the game, this technique is likely to be extremely computationally intensive, so it’s probably not ready for the Microsoft production line any time soon.