(Arc Prize)

The toughest AI benchmark just got a whole lot tougher

ARC-AGI-3 is the latest version of a clever benchmark that challenges AI models to solve mini video games with no written instructions.

Jon Keegan

3/26/26 3:10PM

The flood of new AI models with increasingly advanced “reasoning” capabilities is forcing the AI industry to abandon early benchmark tests and invent new ones to test for many skills.

To watch the evolution of one such test — ARC-AGI — is to witness the huge technical leaps that today’s generative-AI models have made in a few short years. Tech CEOs brag about their models’ high scores on ARC-AGI, as it is widely considered one of the most unique and difficult AI benchmarks in use today.

Rather than testing how well a model can translate an inscription on an ancient Roman tombstone, or offer a diagnosis for a complex medical case, ARC-AGI challenges AI models to analyze abstract geometric puzzles and games without any written instructions. This ensures that the models are forced to create solutions to complex multistep problems, rather than regurgitate text from their training.

We created an in-house game studio and built 135 novel environments from scratch

No instructions, Core Knowledge Priors-only

In order to beat these games, AI must:
• Explore the environment
• Form hypotheses
• Execute a plan
• Learn and adapt pic.twitter.com/oaSVhut7Cp
— ARC Prize (@arcprize) March 25, 2026

The latest version that just launched, ARC-AGI-3, is basically a collection of mini games, which the user can play by moving simple shapes through a pixelated game board using directional arrows. As designed, the games are easy for humans to figure out after a few minutes of experimentation, but incredibly difficult for computers to solve.

François Chollet, the creator of ARC-AGI, told Sherwood News in an email:

“You can’t cram for the test. It requires you to explore and figure out each environment on the fly, on your own, instead of relying on extensive training data. Humans are really good at adapting to novelty, but AI systems are still fundamentally reliant on memorized templates.”

Chollet said that even after AI models have seen thousands of games, they struggle with ARC-AGI-3 games, since they are all unique.

One of the fascinating new features of the latest version is a replay mode that lets human observers read through AI models’ “chain of thought” transcript to see how a model breaks down the problem and attempts a solution.

Humans can play through these games on the project’s website. For now it seems humans don’t have much to worry about.

The most capable state-of-the-art models in the wild haven’t even cracked a 1% score (out of 100). The current leaderboard for ARC-AGI-3 shows OpenAI’s GPT-5.4 in the lead at 0.3%, and tied for second place are Anthropic’s Opus 4.6 and Google’s Gemini 3.1 Pro. xAI’s Grok 4.20 Reasoning model got a 0%.

Chollet says his team is already working on future versions of ARC-AGI:

“We are currently working on ARC-AGI-4 and ARC-AGI-5. We will release a new benchmark every year, each time asking the most important unsolved questions on the way to AGI. Three important topics we’re looking at for future versions are continual learning, open-endedness, and autonomous invention.”

Updated to include comments from François Chollet.

Rani Molla5/8/26

Intel pops on reported Apple chip deal

Intel soared more than 14% on a Wall Street Journal report saying the company has reached a preliminary agreement with Apple to manufacture chips for the iPhone maker. Intel, already on a tear as of late, jumped earlier this week when Bloomberg first reported the two companies were in talks. It’s still unclear which chips Intel would manufacture for Apple, which has been facing supply constraints for its iPhone as well other products.

In any case, the deal could help Apple ease supply constraints that have hit some of its products and reduce its reliance on longtime partner TSMC, as it aims to bring more chip manufacturing stateside.

Apple, Intel Have Reached Preliminary Chip-Making Agreement

Microsoft CEO Satya Nadella (R) greets OpenAI CEO Sam Altman during the OpenAI DevDay event

Emails show Microsoft wasn’t impressed by OpenAI’s early work, but wanted to keep it from Amazon

OpenAI wanted further Azure computing discounts, but Microsoft didn’t think it was on the verge of a breakthrough.

Jon Keegan5/8/26

5/8/26

INTO ORBIT?

...or back to Earth?

Rani Molla

SpaceX Launches Tesla Roadster Into Space

Rani Molla5/8/26

Wedbush’s Dan Ives raises Apple price target to $400 on $15 billion AI services opportunity

Apple may not have a frontier AI model or a fully functional AI assistant, but that won’t stop the company from throwing its weight around in the “AI revolution,” according to Wedbush Securities analyst Dan Ives. That’s enough for Ives to raise his price target for Apple shares to $400 from $350.

Underpinning that jump is what Ives sees as a $15 billion annual revenue opportunity for Apple in AI services from monetizing other companies’ models by distributing them to its 2.5 billion iOS users. Ives estimates that in the coming years, roughly 20% of the world’s population will access AI through an Apple device, calling it the “consumer hub of AI.”

That new era, Ives expects, will officially kick off at Apple’s developer conference in June, where he expects Apple to “finally unveil its AI strategy.”

Rani Molla5/7/26

Tesla’s Model Y just cleared a new federal safety bar

The National Highway Traffic Safety Administration announced today that Tesla Model Ys manufactured after November 12 were the first to pass the agency’s new advanced driver assistance system tests, which are now part of the New Car Assessment Program. According to NHTSA, Tesla tested the 2026 Model Y and submitted the test results to the organization for review.

“By successfully passing these new tests, the 2026 Tesla Model Y demonstrates the lifesaving potential of driver assistance technologies and sets a high bar for the industry,” NHTSA Administrator Jonathan Morrison wrote in the press release. “We hope to see many more manufacturers develop vehicles that can meet these requirements.”

The new tests include:

Pedestrian automatic emergency braking
Lane-keeping assistance
Blind spot warning
Blind spot intervention

The milestone offers Tesla highly coveted regulatory validation, as it seeks to spur usage of its Full Self-Driving (Supervised) tech.