Companies

AI's Hidden Challenge: Humanity's Last Exam

Published December 27, 2024

In San Francisco, two major players in artificial intelligence have issued a challenge to the public. They are looking for questions that can truly test the abilities of large language models (LLMs) like Google Gemini and OpenAI’s o1. Scale AI, a company specializing in preparing extensive datasets used to train LLMs, has partnered with the Center for AI Safety (CAIS) to create a new initiative titled Humanity’s Last Exam.

This initiative offers a prize of US$5,000 (£3,800) for individuals who can formulate the top 50 questions for the test. Both Scale and CAIS aim to check how close we are to building “expert-level AI systems” by gathering “the largest, broadest coalition of experts in history”.

Why is this important? The leading LLMs are already showing impressive results in various established tests covering intelligence, mathematics, and law. However, it is difficult to determine how significant these results are. Many of these models may have essentially learned the answers beforehand, given the immense amounts of data they are trained on, which includes a considerable part of the internet.

Data plays a crucial role in transitioning from conventional computing to artificial intelligence. This shift emphasizes “showing” machines what to do rather than simply “telling” them. Achieving this demands not only robust training datasets but also effective testing methods. Developers usually rely on datasets that have not been previously used for training, known as test datasets, to evaluate the models.

As LLMs get better, it is likely they will be able to answer established tests, such as bar exams, even if they cannot do so yet. AI analytics firm, Epoch, estimates that by 2028, AIs might have access to nearly everything ever written by humans. A significant challenge will be to keep assessing these AIs once this point is reached.

Moreover, the internet is growing rapidly, with millions of new materials being added daily. Could this expansion resolve the problem of testing? It might, but there is another complex issue: “model collapse”. As the internet becomes inundated with AI-generated content, the performance of AIs may decline when they are retrained on this same material. To combat this, many developers are already collecting data from their AIs' interactions with humans to maintain the relevancy of their training and testing sets.

Future of AI Testing

Some experts argue that AIs should become “embodied”, meaning they should move around in real-world environments and gather their own experiences, similar to how humans learn. While this may seem unrealistic, companies like Tesla have been implementing this concept for years with their cars. Additionally, wearable devices, such as Meta's popular smart glasses by Ray-Ban, equipped with cameras and microphones, can gather extensive human-centric video and audio data.

However, even if such products promise future training data, there is still the question of how to define and measure intelligence, especially artificial general intelligence (AGI), which refers to AI that can match or exceed human intelligence. Traditional IQ tests for humans have long faced criticism for their inability to accurately reflect the diverse aspects of intelligence, such as empathy, mathematical skill, and spatial abilities.

The tests used for AIs face a similar problem. There are many established tests for specific tasks—including summarizing text, drawing inferences, and recognizing human gestures. While some of these tests are being phased out because AIs excel at them, they are often too narrow in focus to represent overall intelligence. For example, Stockfish, a chess-playing AI, far outperforms human champion Magnus Carlsen in chess. However, it cannot perform other tasks like language comprehension. This illustrates that one should not equate proficiency in such specialized tasks with broader intelligence.

As AIs start demonstrating wider intelligent behavior, the challenge becomes developing new standards for assessing and tracking their growth. A notable approach comes from François Chollet, a French engineer at Google. He posits that true intelligence is the ability to adapt and apply knowledge to new situations. In 2019, he introduced the abstraction and reasoning corpus (ARC), a set of puzzles designed to test an AI's capacity to infer and apply abstract reasoning rules.

Unlike previous assessments that train AIs using millions of images with detailed information, ARC provides minimal examples. The AI must deduce the puzzle logic itself rather than memorize all possible answers. While these ARC tests are not overly difficult for humans, a prize of US$600,000 is offered to the first AI that scores 85%. As of now, the leading LLMs, OpenAI’s o1 preview and Anthropic’s Sonnet 3.5, only score about 21% on the public ARC leaderboard.

Another recent attempt by OpenAI’s GPT-4o achieved a score of 50%, but the methods used raised some controversy, as it generated thousands of possible solutions before settling on the one deemed the best response. Even still, this score remains significantly lower than the human benchmark of over 90%.

While ARC represents one of the most legitimate current approaches to evaluating AI intelligence, the Scale and CAIS initiative demonstrates that the search for effective testing continues. Interestingly, the top questions selected for the test will not be published online, ensuring that AIs do not have access to the exam materials.

Understanding when machines approach human-level reasoning raises various safety, ethical, and moral questions. When that threshold is crossed, we will likely face another pressing inquiry: how do we assess superintelligence? This is an even more complex task we must prepare to tackle.

AI, intelligence, testing