ARC-AGI-2 Challenges AI Models with Human-Level Intelligence Test

The Arc Prize Foundation has begun a groundbreaking challenge ARC-AGI-2. This proving ground seeks to level the playing field by measuring the common sense reasoning of leading artificial intelligence models. This brutal exam is intended to set a human baseline. It does so by making a joint comparison of the performance of AI models to that of human participants. More than 400 people took the test, scoring an average of 60%, and besting the AI models by a long shot.

ARC-AGI-2 is an extremely difficult test for AI systems. Both OpenAI’s o1-pro and DeepSeek’s R1 scored between 1% and 1.3%. Likewise, non-reasoning models, including GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash achieved ~1%. These results show how nuanced and complex the test is and the challenge AI models have toward replicating human intelligence.

Until then, ARC-AGI-1 held an undefeated record for nearly five years until December 2024, when OpenAI launched its state-of-the-art reasoning model, o3. Of all these approaches, this model was able to outperform human performance on ARC-AGI-1, with its small variant achieving a score of 75.7%. Yet, against the same model, ARC-AGI-2 did poorly, only getting 4% while spending $200 worth of computing power per task.

In order to encourage even more innovation, The Arc Prize Foundation has recently launched the Arc Prize 2025 contest. This competition challenges developers to achieve a staggering 85% accuracy score on ARC-AGI-2. They have to accomplish it within a very strict budget of just $0.42 per task. The new test shifts the goalposts by introducing a new, confusing metric of efficiency. Specifically, it excludes AI models from using “brute force” solutions that rely solely on massive computing power.

Tags

Leave a Reply

Your email address will not be published. Required fields are marked *