Ai Benchmark Humanitys Last Exam Models Openai Google

Despite facing increasingly harder tests, artificial intelligence models have been advancing quickly and passing even PhD-level exams with high scores, making it somewhat difficult to track just how good they’re getting. But it seems the AI models have met their match — at least for now.

IBM’s AI-fueled comeback is just getting started

The researchers evaluated several multimodal frontier models, including Anthropic’s Claude Sonnet-3.5, Google’s Gemini 1.5 Pro, and both OpenAI’s GPT-4o and new reasoning model, o1. All of the models scored less than 10% on HLE — much lower than on popular benchmarks such as Massive Multitask Language Understanding (MMLU) and graduate-level Google-Proof Q&A (GPQA).

“We wanted problems that would test the capabilities of the models at the frontier of human knowledge and reasoning,” Dan Hendrycks, co-founder and executive director of CAIS said in a statement. “We can’t predict how quickly the models will advance... Right now, Humanity’s Last Exam shows that there are still some expert closed-ended questions that models are not able to answer. We will see how long that lasts.”

Hendrycks is also an advisor to Scale, which he worked with to compile the more than 3,000 multiple-choice and short-answer questions across more than 100 subjects. Questions included asking for a translation for a Palmyrene script, and hummingbird anatomy.

The researchers received exam questions from close to “1,000 subject expert contributors” from around the world, who they asked to submit the “toughest questions” they know. Prize money was offered for the top questions, while contributors whose questions were chosen were offered optional co-authorship.

While the current top AI models failed the HLE, “recent history shows benchmarks are quickly saturated — with models dramatically progressing from near-zero to near-perfect performance in a short timeframe,” the researchers said.

It’s “plausible” that the AI models could reach higher than 50% accuracy on the HLE by the end of the year, the researchers said. However, that alone wouldn’t “suggest autonomous research capabilities or ‘artificial general intelligence,’” they added, referring to the point when AI systems will be believed to have reached and exceeded human-level capabilities.

“HLE may be the last academic exam we need to give to models, but it is far from the last benchmark for AI,” the researchers said.

Researchers just stumped AI with their most difficult test — but for how long?

A new AI benchmark called "Humanity's Last Exam" stumped top models

Suggested Reading

Related Content

📬 Sign up for the Daily Brief

Our free, fast and fun briefing on the global economy, delivered every weekday morning.