The toughest AI benchmark just got a whole lot tougher
ARC-AGI-3 is the latest version of a clever benchmark that challenges AI models to solve mini video games with no written instructions.
The flood of new AI models with increasingly advanced “reasoning” capabilities is forcing the AI industry to abandon early benchmark tests and invent new ones to test for many skills.
To watch the evolution of one such test — ARC-AGI — is to witness the huge technical leaps that today’s generative-AI models have made in a few short years. Tech CEOs brag about their models’ high scores on ARC-AGI, as it is widely considered one of the most unique and difficult AI benchmarks in use today.
Rather than testing how well a model can translate an inscription on an ancient Roman tombstone, or offer a diagnosis for a complex medical case, ARC-AGI challenges AI models to analyze abstract geometric puzzles and games without any written instructions. This ensures that the models are forced to create solutions to complex multistep problems, rather than regurgitate text from their training.
We created an in-house game studio and built 135 novel environments from scratch
— ARC Prize (@arcprize) March 25, 2026
No instructions, Core Knowledge Priors-only
In order to beat these games, AI must:
• Explore the environment
• Form hypotheses
• Execute a plan
• Learn and adapt pic.twitter.com/oaSVhut7Cp
The latest version that just launched, ARC-AGI-3, is basically a collection of mini games, which the user can play by moving simple shapes through a pixelated game board using directional arrows. As designed, the games are easy for humans to figure out after a few minutes of experimentation, but incredibly difficult for computers to solve.
One of the fascinating new features of the latest version is a replay mode that lets human observers read through AI models’ “chain of thought” transcript to see how a model breaks down the problem and attempts a solution.
Humans can play through these games on the project’s website. For now it seems humans don’t have much to worry about.
The most capable state-of-the-art models in the wild haven’t even cracked a 1% score (out of 100). The current leaderboard for ARC-AGI-3 shows OpenAI’s GPT-5.4 in the lead at 0.3%, and tied for second place are Anthropic’s Opus 4.6 and Google’s Gemini 3.1 Pro. xAI’s Grok 4.20 Reasoning model got a 0%.
.gif?auto=compress%2Cformat&cs=srgb&fit=max&w=3840)