Tech
A screenshot from arcprize.org.
(Screenshot: https://arcprize.org/arc)

OpenAI’s ARC de Triumph

How a puzzle designed to resist memorization is reshaping AI intelligence tests. See if you can score better than an AI model.

2/28/25 3:51PM

OpenAI’s Sam Altman thinks that the upcoming GPT-5 AI model will be smarter than him. Emphasis ours:

“I don’t think I’m going to be smarter than GPT-5. And I don’t feel sad about it because I think it just means that we’ll be able to use it to do incredible things.” 

But what does it actually mean for an AI to be “smart”? It turns out this is pretty difficult to nail down, as the AI world can’t even decide what the definition of “artificial general intelligence” (AGI) is. While imperfect, the industry has embraced the use of “benchmarks” — tests designed to measure an AI model’s knowledge and reasoning ability.

Want to test how well an AI model can write code? There’s a benchmark for that: SWE-bench. High school math? AIME 2024. General knowledge across multiple subjects? MMLU

AI companies are quick to boast about how well their new model tested on different benchmarks, using them as proof of progress. But recently, a new high score for one of the most challenging benchmarks caught the industry’s attention. 

Unlike traditional benchmarks filled with multiple-choice questions, ARC-AGI is unique and challenges AI with purely visual puzzles designed to test complex reasoning skills. For five years, no AI model could score higher than 5% on the test. 

That changed when OpenAI announced on December 20, 2024, that its just-released “o3” model had solved the ARC-AGI test. This marked the first time any AI had passed the test — a huge leap over the other state-of-the-art models. 

François Chollet, an AI researcher and cofounder of new AI startup Ndea, created the ARC-AGI challenge in 2019. When asked about the significance of OpenAI’s breakthrough score on his test, Chollet called it “a major achievement.”

In an email to Sherwood News, Chollet noted that OpenAI’s previous GPT models all scored near zero on the test. “It demonstrates the fact that o3 is not limited to memorized skills and memorized knowledge, unlike the GPT series, or really unlike any prior major AI system. It is actually capable of adapting to novelty, at least in the very simple context of ARC-AGI tasks,” Chollet said.

Many of the previous testing tools failed to demonstrate what Chollet considers actual intelligence: “the ability to adapt on the fly to new situations and new problems. Rather they demonstrated memorized skill,” Chollet said. 

Input > Output

So what does this test look like? It’s a series of tasks comprised of visual puzzles. The puzzle is made up of a series of colored tiles on a grid, with the computer reading each color as a different number. 

Each task shows the test-taker a series of problem and solution pairs — the input and the output. The user has to look at the sample pairs and deduce the steps needed to solve these puzzles. Then the user is asked to solve a puzzle that only shows the input, applying the same solution from the examples. 

A screenshot from arcprize.org.
A screenshot from arcprize.org. Source: https://arcprize.org/arc

Chollet says the test was designed to not require any special training, but only innate “core knowledge” like object persistence, basic geometry, numbers and counting, and that objects can be animate or inanimate.  

“Most of the puzzles are elementary, something a young child could figure out,” Chollet explained. “There’s just one thing: it’s designed to resist memorization. Each task is unique. You must figure out each task on the fly.”

Note: this quiz is optimized for desktop and is best viewed on a computer.

OpenAI shared the results of its testing with Chollet, which he has verified. The company has not yet released the full results of the testing for the community to assess. The results we can see require some unpacking, though. 

According to Chollet, the test was done in several parts. One test used a “high efficiency, low cost” computing process. This cheaper, low-compute mode test scored an impressive 76%, which beat all previous records, but did not achieve the 85% passing score. The time it took for each puzzle to be solved was 1.3 minutes on average and it cost about $20 per puzzle in computing power to solve each one. 

But when OpenAI cranked it up and used the “low efficiency, high cost” process, it scored a winning 88% on the test. According to Greg Kamradt, president of the nonprofit ARC Prize Foundation — which develops and funds benchmarks like ARC-AGI — human performance on the test is about 85%. To achieve such a high score, OpenAI’s model had to work really hard and it cost exponentially more.

On average, the most computing-intensive test took 13.8 minutes of “thinking” for each puzzle. That’s a lot of time for some expensive GPUs, like Nvidia’s popular H100 chip, and all that thinking isn’t cheap.

The cost to solve each puzzle was about $3,400, according to estimates from Chollet. 

In January 2025, shortly after the release of DeepSeek’s R1 “reasoning” model, the foundation announced that the new model scored about as well as OpenAI’s “o1-preview” model, scoring a 20% on the test, but only costing $0.05 per puzzle on average.

Yesterday, OpenAI released GPT-4.5, the last “non-reasoning” model the company will release, and you might see why. The new model achieved a score of only 10% (for $0.29 per task). 

As the industry grapples with a possible plateau of the AI “scaling law” that fueled the first wave of the generative-AI boom, a consensus is emerging that multistep reasoning like that found in o3 and DeepSeek R1 is going to be an important part of future performance gains. But as these test results show, longer, more intensive AI computing also drastically increases the costs — both fiscal and environmental. 

Big Tech is planning on huge demand for this type of energy-thirsty computing, and intends to spend $315 billion this year alone on AI data centers and computing infrastructure. 

But… is this AGI?

In the 2019 research paper in which Chollet proposed the ARC dataset, he speculated about the capabilities of an advanced AI system that could one day solve the test. 

We posit that the existence of a human-level ARC solver would represent the ability to program an AI from demonstrations alone (only requiring a handful of demonstrations to specify a complex task) to do a wide range of human-relatable tasks of a kind that would normally require human-level, human-like fluid intelligence, Chollet wrote, emphasis ours. 

While its not clear that any AI has reached this level of intelligence yet, it is clear that new benchmarks will be needed to test new models emerging from the fast-moving AI development pipeline. The ARC foundation is currently funding development of ARC-AGI-2, which will be released in March, as well as ARC-AGI-3, which will look more like simple animated 8-bit video games. 

The ARC-AGI-2 benchmark was designed specifically to be difficult for these new reasoning models, but still easy for humans to solve. Today, humans appear to have the advantage, but the rapid pace of progress suggests that won’t last forever.

OpenAI did not return a request for comment. 

More Tech

See all Tech
tech

Report: Microsoft adds Anthropic alongside OpenAI in Office 365, citing better performance

In a move that could test its fraught $13 billion partnership, Microsoft is moving away from relying solely on OpenAI to power its AI features in Office 365 and will now also include Anthropic’s Claude Sonnet 4 model, according to a report from The Information.

The move is a tectonic shift that boosts Anthropic’s standing, heightens risks for OpenAI, and has huge ramifications for the balance of power in the fast-moving AI field.

Per the report, Microsoft executives found that Anthropic’s AI outperformed OpenAI’s on tasks involving spreadsheets and generating PowerPoint slide decks, both crucial parts of Microsoft’s Office 365 productivity suite.

Microsoft will have to pay the competition to provide the services —Amazon Web Services currently hosts Anthropic’s models while Microsoft’s Azure cloud service does not, The Information reported.

OpenAI is also reportedly working on its own productivity suite of apps.

The move is a tectonic shift that boosts Anthropic’s standing, heightens risks for OpenAI, and has huge ramifications for the balance of power in the fast-moving AI field.

Per the report, Microsoft executives found that Anthropic’s AI outperformed OpenAI’s on tasks involving spreadsheets and generating PowerPoint slide decks, both crucial parts of Microsoft’s Office 365 productivity suite.

Microsoft will have to pay the competition to provide the services —Amazon Web Services currently hosts Anthropic’s models while Microsoft’s Azure cloud service does not, The Information reported.

OpenAI is also reportedly working on its own productivity suite of apps.

tech

Apple announces extra slim iPhone Air, iPhone Pro with longer battery life, updated AirPods Pro 3 with live language translation, and refreshed Apple Watch line

At todays Awe Dropping Apple event, the company announced its yearly refresh of the iPhone lineup. The new iPhone 17, iPhone 17 Pro, and iPhone 17 Pro Max were joined by a brand-new addition: the iPhone Air, a superthin model with tougher glass and faster processors.

Apple shares dipped on news of the product releases and are down about 1.4% on the day in afternoon trading.

The company also announced an updated Apple Watch line — Series 11, SE3, and Ultra 3 — with new features like 5G, high blood pressure detection, 24-hour battery life, and satellite communication. 

Apple iPhone 17
Apple’s iPhone 17 (Photo: Apple)

Here’s a breakdown of the new products Apple announced:

  • The ultrathin iPhone Air was described by Apple as “a paradox you have to hold to believe.” The sleek 5.6-millimeter-thin iPhone features a crack- and scratch-resistant front and back and “Macbook Pro levels of compute,” which you can pair with a weird $59 cross-body strap. It starts at $999.

  • The iPhone 17 has a faster A19 chip, an improved smart selfie camera, and a higher-resolution screen. It starts at $799.

  • The iPhone 17 Pro has a new design, ever-faster A19 Pro chip, a tougher ceramic shield on the front and back, better cameras, and a bigger battery that gets an extra 10 hours of video playback compared to its predecessor. It costs $100 more than the previous generation, but the minimum storage has doubled to 256 gigabytes. It starts at $1,099.

  • The iPhone 17 Pro Max starts at $1,199.

  • The AirPods Pro 3 have AI-powered live translation, a new heart rate sensor, eight hours of battery life, and improved active noise cancellation. The new AirPods can also track workouts, and Apple says they are built to fit more people’s ears with a new design and foam ear tips. They start at $249.

  • The Apple Watch Series 11 has 5G, a new high blood pressure detection feature, improved sleep tracking, a more scratch-resistant face, and 24 hours of battery life.

  • The entry-level Apple Watch SE 3 gets 5G, new health-tracking features, and an always-on display. It starts at $249.

  • The chunky Apple Watch Ultra 3 has an impressive 42-hour battery life, satellite communications for emergencies, and a brighter and bigger display. It starts at $799.

tech

Nebius soars after signing a 5-year deal with Microsoft to supply nearly $20 billion worth of AI computing power

Artificial intelligence infrastructure group Nebius jumped more than 50% in early trading on Tuesday after the company announced after the close on Monday a major deal to supply computing power for Microsoft’s AI operations.

Under the agreement, Nebius — which rose from the ashes of Russian tech giant Yandex — will provide Microsoft “access to dedicated GPU infrastructure capacity in tranches at its new data center in Vineland, New Jersey over a five-year term.” The New Jersey data center has a capacity of 300 megawatts. The total contract value through 2031 is $17.4 billion, though, if further capacity is required, the contract value could rise to $19.4 billion.

The deal represents a sizable portion of Microsofts proposed annual capital expenditure on AI, which is expected to reach $120 billion by the end of fiscal 2026.

Nebius and competitor CoreWeave are both on the short list of startups that Nvidia has invested in. Nvidia’s small stake in the former is now worth about $120 million.

Under the agreement, Nebius — which rose from the ashes of Russian tech giant Yandex — will provide Microsoft “access to dedicated GPU infrastructure capacity in tranches at its new data center in Vineland, New Jersey over a five-year term.” The New Jersey data center has a capacity of 300 megawatts. The total contract value through 2031 is $17.4 billion, though, if further capacity is required, the contract value could rise to $19.4 billion.

The deal represents a sizable portion of Microsofts proposed annual capital expenditure on AI, which is expected to reach $120 billion by the end of fiscal 2026.

Nebius and competitor CoreWeave are both on the short list of startups that Nvidia has invested in. Nvidia’s small stake in the former is now worth about $120 million.

President Trump hosts tech executives and their guests to a dinner at the White House in the Oval Office.

Here are the Trump ties among the tech leaders who had dinner at the White House

Many of the attendees have donated to, vocally supported, or even worked for the president.

Latest Stories

Sherwood Media, LLC produces fresh and unique perspectives on topical financial news and is a fully owned subsidiary of Robinhood Markets, Inc., and any views expressed here do not necessarily reflect the views of any other Robinhood affiliate, including Robinhood Markets, Inc., Robinhood Financial LLC, Robinhood Securities, LLC, Robinhood Crypto, LLC, or Robinhood Money, LLC.