Tech
tech
Jon Keegan

If AI models can ace every test, it’s actually not a good thing

AI companies are eager to show how much “smarter” and more capable their latest large language models are. To highlight these improvements, companies point to scores on widely used standardized tests like HellaSwg or MMLU as AI benchmarks to flex on how well the tools can write code or solve problems.

But these tools have some serious flaws, with many of the most-cited benchmarks created to test much simpler AI systems years before chatbots like ChatGPT hit the scene. Many of the tools were created by scraping amateur websites and lacked expert oversight. And because these benchmarks have been widely available on the internet for years, AI models have likely seen them as part of their training data.

The Financial Times reports that AI companies are now racing to build a new class of benchmarks, as today’s AI models all score at or above 90% on these tests.

As AI companies start to pivot to a model of AI “agents” that can control your computer and use multistep reasoning, new benchmarks are being created that can test these capabilities in a more meaningful way.

But a lack of regulatory oversight or industry standards means the public must rely on companies like Meta and OpenAI to test their own products for safety, and provides no standard way to compare the results between them.

But these tools have some serious flaws, with many of the most-cited benchmarks created to test much simpler AI systems years before chatbots like ChatGPT hit the scene. Many of the tools were created by scraping amateur websites and lacked expert oversight. And because these benchmarks have been widely available on the internet for years, AI models have likely seen them as part of their training data.

The Financial Times reports that AI companies are now racing to build a new class of benchmarks, as today’s AI models all score at or above 90% on these tests.

As AI companies start to pivot to a model of AI “agents” that can control your computer and use multistep reasoning, new benchmarks are being created that can test these capabilities in a more meaningful way.

But a lack of regulatory oversight or industry standards means the public must rely on companies like Meta and OpenAI to test their own products for safety, and provides no standard way to compare the results between them.

More Tech

See all Tech
🚀 $100B

Alphabet’s 2015 investment in SpaceX is about to pay off handsomely with the company’s hotly anticipated IPO later this year, which is expected to be the largest in history.

Bloomberg reports that according to new financial filings, Alphabet’s investment could be worth up to $100 billion.

Google invested in SpaceX in 2015 when it, along with Fidelity, invested $1 billion in a round that valued SpaceX at $10 billion. At the end of 2025, Google owned just over 6% of SpaceX, per Bloomberg’s reporting on the more recent filings. That stake has likely been diluted due to SpaceX’s merger with xAI.

$1

Barclays says autonomous couriers — think sidewalk robots and drones — could push delivery costs down to as little as $1 per order, from between $5 and $7 today and closer to $9 for traditional deliveries in high-labor-cost markets. If robots save $4 on every delivery, and enough companies start using them, the food delivery industry, including companies like DoorDash and Uber, could end up with $16 billion in extra profit every year, according to Barclays.

The catch: we’re nowhere near that world yet. Robots and drones handle less than 1% of deliveries today. Even by 2035, Barclays only sees penetration hitting around 10%.

Google’s Wing and Amazon have also been trying to crack last-mile product delivery — a reminder that this is part of a broader race to automate the most expensive leg of e-commerce.

$10B

Uber has long had an asset-light business model: it provided the ride-hailing platform, and its contract workers brought their own vehicles. That’s changing as Uber positions itself at the center of the robotaxi era.

The Financial Times estimates that Uber has committed more than $10 billion to buying robotaxi fleets ($7.5 billion) and investing in the companies that make them ($2.5 billion). That includes yesterday’s announcement that its expanding its investment in Lucid, a deal worth about $2 billion, with plans to buy 35,000 vehicles.

This shift pits Uber against industry leaders like Google’s Waymo and Tesla, whose models involve company-owned vehicles running on proprietary platforms. While these autonomous fleets eliminate the need for drivers, they introduce new capital-intensive requirements for charging, cleaning, storage, and repair.

Latest Stories

Sherwood Media, LLC produces fresh and unique perspectives on topical financial news and is a fully owned subsidiary of Robinhood Markets, Inc., and any views expressed here do not necessarily reflect the views of any other Robinhood affiliate, including Robinhood Markets, Inc., Robinhood Financial LLC, Robinhood Securities, LLC, Robinhood Crypto, LLC, Robinhood Derivatives, LLC, or Robinhood Money, LLC. Futures and event contracts are offered through Robinhood Derivatives, LLC.