Friday, March 6, 2026

Humanity's Last Exam -- New Benchmark Test To Test Accuracy Of Chatbots -- March 6, 2026

Locator: 50154AI.

Link here

Humanity's Last Exam (HLE) is a language model benchmark consisting of 2,500 questions across a broad range of subjects. 

It was created jointly by the Center for AI Safety and Scale AI. 

Stanford cites Humanity's Last Exam as "one of the more challenging benchmarks" developed in response to the popular AI benchmarks having reached "saturation".

The test has been described as the brainchild of Dan Hendrycks, a machine learning researcher and the director of the Center for AI Safety, San Francisco, who stated that he was inspired to create the test after a conversation with Elon Musk, who thought the existing language model benchmarks, such as the MMLU (wiki link here), were too easy. 

Hendrycks worked with Scale AI to compile the questions. The questions were crowdsourced from subject matter experts from various institutions across the world.

Sample question: 

Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.

Put that in your pipe and smoke it! Better yet, put it in your top three favorite chatbots and see how they do.  

The answer to the question, by the way is: four (4). One pair of tendons on each side: two pair x two tendons per pair.

The question is a bit ambiguous -- oe must read it carefully. Is the question asking for the number of pairs or the the total number of tendons:

  • total number of tendons: 4
  • number of pairs: 2
  • number of tendons in each par: 2 

Interestingly, when asking for an image / graphic of this bone, Google Gemini said such an image is not available to the typical search engines because the typical search engines don't search the highly detailed, academic, peer-reviewed sources that would be required to find such an image / graphic. 

AI has a long, long way to go. That seems like a fairly basic question for AI?