In This Story
Despite facing increasingly harder tests, artificial intelligence models have been advancing quickly and passing even PhD-level exams with high scores, making it somewhat difficult to track just how good they’re getting. But it seems the AI models have met their match — at least for now.
According to the results of a new benchmark called “Humanity’s Last Exam” (HLE), top AI models from OpenAI, Google (GOOGL-0.74%), and Anthropic aren’t quite “at the frontier of human knowledge” yet. The evaluation, developed by researchers at the Center for AI Safety (CAIS) and Scale AI, is “designed to be the final closed-ended academic benchmark of its kind with broad subject coverage.” Basically, they claim it’s the most difficult test these AI models have ever faced.
The researchers evaluated several multimodal frontier models, including Anthropic’s Claude Sonnet-3.5, Google’s Gemini 1.5 Pro, and both OpenAI’s GPT-4o and new reasoning model, o1. All of the models scored less than 10% on HLE — much lower than on popular benchmarks such as Massive Multitask Language Understanding (MMLU) and graduate-level Google-Proof Q&A (GPQA).
“We wanted problems that would test the capabilities of the models at the frontier of human knowledge and reasoning,” Dan Hendrycks, co-founder and executive director of CAIS said in a statement. “We can’t predict how quickly the models will advance... Right now, Humanity’s Last Exam shows that there are still some expert closed-ended questions that models are not able to answer. We will see how long that lasts.”
Hendrycks is also an advisor to Scale, which he worked with to compile the more than 3,000 multiple-choice and short-answer questions across more than 100 subjects. Questions included asking for a translation for a Palmyrene script, and hummingbird anatomy.
The researchers received exam questions from close to “1,000 subject expert contributors” from around the world, who they asked to submit the “toughest questions” they know. Prize money was offered for the top questions, while contributors whose questions were chosen were offered optional co-authorship.
While the current top AI models failed the HLE, “recent history shows benchmarks are quickly saturated — with models dramatically progressing from near-zero to near-perfect performance in a short timeframe,” the researchers said.
It’s “plausible” that the AI models could reach higher than 50% accuracy on the HLE by the end of the year, the researchers said. However, that alone wouldn’t “suggest autonomous research capabilities or ‘artificial general intelligence,’” they added, referring to the point when AI systems will be believed to have reached and exceeded human-level capabilities.
“HLE may be the last academic exam we need to give to models, but it is far from the last benchmark for AI,” the researchers said.