Digital assistants don’t know when to can it. If you’ve asked Siri or Alexa a question, you’ll know the pain of getting a response loosely tied to some Wikipedia article that doesn’t really answer your original question.
A new dataset from Stanford aims to teach AI systems to understand how to answer questions more effectively, by knowing when there isn’t enough information to provide an accurate answer. The dataset is called SQuAD 2.0, short for the Stanford Question Answering Dataset. It’s an update of an earlier, wildly popular dataset used by companies like Microsoft, Google, and Alibaba to show off how accurate their language-understanding AI systems are at answering questions.
Before we get into SQuAD 2.0, a quick primer on how AI is trained. Deep learning, a modern flavor of AI used by every big tech company today, is basically complex pattern-recognition software. Given enough data, like sentences or images, it can find the pattern of words or pixels associated with an idea. The sentence, “birds have wings and feathers,” in a dataset of sentences about the animal world will form the AI’s concept of a bird, since the AI has seen thousands of other sentences about other kinds of animals and seen that pattern that “bird” is the subject of the sentence, and “wings” and “feathers” are attributes of it. Datasets like Stanford’s are the raw fuel for teaching AI systems about the world.
Datasets before SQuAD 2.0, including the its first version, all worked by providing a paragraph of text to the algorithm being trained and then asking it to answer a few questions. But those datasets typically have one underlying assumption—that the answer actually existed in the text. Now, an algorithm trained on SQuAD 2.0 will have to decide either how to answer the question correctly or whether it can be answered at all.
“If you ask [Google] ‘who’s the current emperor of China,’ it’s a question where there’s no answer, because there is no emperor anymore. But Google will actually give you the last emperor of China,” Robin Jia, co-author of the paper, told Quartz. “It’s easy for these systems to give an answer that ends up being quite misleading.”
The new dataset includes nearly 50,000 questions that are unanswerable, but purposely crafted, so they loosely relate to the subject matter of the reference text. That’s not to say that AI systems have proven very good at answering the dataset’s trick questions yet. The Stanford researchers‘ first crack at training a question-answering algorithm on the dataset scored 66%, 20 points lower than the previous iteration of the SQuAD dataset, since the AI keeps trying to answer the unanswerable questions. By publishing this dataset, other researchers will have the ability to train their algorithms, figuring out better and better ways to make their AI systems answer questions. (For comparison, algorithms trained on the original version of SQuAD only scored 51% when it was first released.)
“We’re trying to do something that’s challenging but also manageable,” Jia said, mentioning that other datasets in the past have actually taken text and questions straight from reading comprehension tests meant for human students. But Jia says that those prompts often rely on an outside understanding of the world, like the implication of a character’s motivation in a story.
Jia says that the team also decided to publish a webpage where the paper’s findings can be demonstrated, as a part of a larger effort in the research group to make their findings reproducible.