What do you need to build a self-driving car? Roboticists and computer scientists have generally settled on similar requirements. Your autonomous vehicle needs to know where the boundaries of the road are. It needs to be able to steer the car and hit the brakes. It needs to know the speed limit, be able to read street signs, and detect if a traffic light is red or green. It needs to be able to react quickly to unexpected objects in its path, and it gets extra points if it knows where it is on a map.
All of those skills are important and necessary. But by building from a list of technical requirements, researchers neglect the single most important part of real-world driving: our intuition. Using it to determine the motivations of those around us is something humans are so effortlessly good at that it’s hard to even notice we’re doing it, nonetheless program for it.
A self-driving car currently lacks the ability to look at a person—whether they’re walking, driving a car, or riding a bike—and know what they’re thinking. These instantaneous human judgments are vital to our safety when we’re driving—and to that of others on the road, too.
As the CTO and cofounder of Perceptive Automata, an autonomous-vehicle software company started by Harvard neuroscientists and computer scientists, I wanted to see how often humans make these kinds of subconscious calls on the road. I took a camera out to a calm intersection near my former lab at Harvard with no traffic signals. It is not by any stretch of the imagination as congested or difficult as an intersection in downtown Boston, let alone Manhattan or Mexico City. But in 30 seconds of video, it is still possible to count more than 45 instances of one person intuiting what’s in the mind of another. These non-verbal, split-second intuitions could be “that person is not going to yield,” “that person doesn’t know I’m here,” or “that person wouldn’t jaywalk while walking a dog.” Is that bicyclist going to turn left or stop? Is that pedestrian going to take advantage of their right-of-way and cross? These judgments happen instantaneously, just watch.
We have lots of empirical evidence that humans are incredibly good at intuiting the intentions of others. The Sally-Anne task is a classic psychology experiment. Subjects—usually children—watch a researcher acting out a scene with dolls. A doll named Sally hides a marble in a covered basket. Sally leaves the room. While Sally is gone, a second doll—Anne—secretly moves the marble out of the basket and into a closed box. When the first doll comes back, children are asked where she will look for the marble. It’s easy to say, “Well, of course she’ll still look in the basket,” as Sally couldn’t have known that the marble had moved while she was gone. But that “of course” is hiding an immensely sophisticated model. Children have to know not only that Sally is aware of some things and not of others, but that her awareness only updates when she is able to pay attention to something. They also have to know that her mental state is persistent, even when she leaves the room and comes back. This task has been repeated many times in labs around the world, and is part of the standard toolkit researchers use to understand if somebody’s social intuitions are intact.
The ability to predict the mental state of others is so innate that we even apply it to distinctly non-human objects. The Heider-Simel experiment shows how we’re prone to ascribe perceived intent even to simple geometric shapes. In this famous study, a film shows two triangles and a circle moving around the screen. With essentially no exceptions, most people construct and elaborate narrative about the goals and interactions of the geometric shapes: One is a villain, one a protector, the third a victim who grows courageous and saves the day—all these mental states and narratives just from looking at geometric shapes moving about. In the psychological literature, this is called an “impoverished stimulus.”
Our interactions with people using the road are an example of an impoverished stimulus, too. We only see a pedestrian for a few hundred milliseconds before we have to decide how to react to them. We see a car edging slightly into a lane for a half second and have to decide whether to yield to them. We catch a fleeting glimpse of a cyclist and judge whether they know we’re making a right turn. These kinds of interactions are constant, and they are at the very core of driving safely and considerately.
And computers, so far, are hopeless at navigating them.
The perils of lacking an intuition for state of mind are already evident. In the first at-fault crash of a self-driving vehicle, a Google self-driving car in Mountain View incorrectly assumed that a bus driver would yield to it, misunderstanding both the urgency and the flexibility of a human driver trying to get around a stopped vehicle. In another crash, a self-driving Uber in Arizona was hit by a turning driver who expected that any oncoming vehicles would notice the adjacent lanes of traffic had slowed down and adjust its expectations of how turning drivers would behave.
Why are computers so bad at this task of mind reading if it’s so easy for people? This circumstance comes up so often in AI development that it has a name: “Moravec’s Paradox.” The tasks that are easiest for people are often the ones that are the hardest for computers. “We’re least aware of what our minds do best,” said the late AI pioneer Marvin Minsky. “We’re more aware of simple processes that don’t work well than of complex ones that work flawlessly.”
So how do you design an algorithm to perform a task if you can’t say with any certainty what the task entails?
The usual solution is to define the task as simply as possible and use what are called deep-learning algorithms that can learn from vast quantities of data. For example, when given a sufficient number of pictures of trees (and pictures of things that are not trees), these computer programs can do a very good job of identifying a tree. If you boil a problem down to either proving or disproving an unambiguous fact about the world—there is a tree there, or there is not—algorithms can do a pretty good job.
But what to do about problems where basic facts about the world are neither simple nor accessible? Humans can make surprisingly accurate judgments about other humans because we have an immensely sophisticated set of internal models for how those around us behave. But those models are hidden from scrutiny, hidden in the black boxes of our minds. How do you label images with the contents of somebody’s constantly fluid and mostly nonsensical inner monologue?
The only way to solve these problems is to deeply understand human behavior—not just by reverse-engineering it, but by characterizing it carefully and comprehensively using the techniques of behavioral science. Humans are immensely capable but have opaque internal mechanisms. We need to use the techniques of human behavioral research in order to build computer-vision models that are trained to capture the nuances and subtleties of human responses to the world instead of trying to guess what our internal model of the world looks like.
First, we need to work out how humans work—second comes training the machines. Only with a rich, deep characterization of the quirks and foibles of human ability can we know enough about the problem we’re trying to solve in order to build computer models that can solve it. By using humans as the model for ideal performance, we are able to gain traction on these difficult tasks and find a meaningful solution to this intuition problem.
And we need to solve it. If self-driving cars are going to achieve their promise as a revolution in urban transportation—delivering reduced emissions, better mobility, and safer streets—they will have to exist on a level playing field with the humans who already use those roads. They will have to be good citizens, not only skilled at avoiding at-fault accidents, but able to drive in such a way that their behavior is expected, comprehensible, and clear to other vehicles’ drivers and the pedestrians and cyclists sharing space with them.