tech

Jon Keegan11/11/24

If AI models can ace every test, it’s actually not a good thing

AI companies are eager to show how much “smarter” and more capable their latest large language models are. To highlight these improvements, companies point to scores on widely used standardized tests like HellaSwg or MMLU as AI benchmarks to flex on how well the tools can write code or solve problems.

But these tools have some serious flaws, with many of the most-cited benchmarks created to test much simpler AI systems years before chatbots like ChatGPT hit the scene. Many of the tools were created by scraping amateur websites and lacked expert oversight. And because these benchmarks have been widely available on the internet for years, AI models have likely seen them as part of their training data.

The Financial Times reports that AI companies are now racing to build a new class of benchmarks, as today’s AI models all score at or above 90% on these tests.

As AI companies start to pivot to a model of AI “agents” that can control your computer and use multistep reasoning, new benchmarks are being created that can test these capabilities in a more meaningful way.

But a lack of regulatory oversight or industry standards means the public must rely on companies like Meta and OpenAI to test their own products for safety, and provides no standard way to compare the results between them.

AI groups rush to redesign model testing and create new benchmarks

AI groups rush to redesign model testing and create new benchmarks

But these tools have some serious flaws, with many of the most-cited benchmarks created to test much simpler AI systems years before chatbots like ChatGPT hit the scene. Many of the tools were created by scraping amateur websites and lacked expert oversight. And because these benchmarks have been widely available on the internet for years, AI models have likely seen them as part of their training data.

The Financial Times reports that AI companies are now racing to build a new class of benchmarks, as today’s AI models all score at or above 90% on these tests.

As AI companies start to pivot to a model of AI “agents” that can control your computer and use multistep reasoning, new benchmarks are being created that can test these capabilities in a more meaningful way.

But a lack of regulatory oversight or industry standards means the public must rely on companies like Meta and OpenAI to test their own products for safety, and provides no standard way to compare the results between them.

More Tech

tech

Rani Molla5/29/26

Microsoft is reportedly building a super app to tame product sprawl — and finally crack mobile

Super apps are very 2010s, but they might be the future for Microsoft. The enterprise giant is working on combining its sprawling and often confusing product suite into a single super app expected by late summer, Fortune reports.

By unifying the tools, Microsoft is hoping that the massive popularity of some of its offerings — particularly GitHub Copilot — will rub off on its other, slower-growing products.

The tool will merge its coding assistant GitHub Copilot, its chat function Copilot, its Copilot Cowork tool, and a new agentic workflow called Autopilot. The move, known internally as “Delivering one Copilot,” will have the dual purpose of simplifying Microsoft’s fragmented desktop AI offerings and finally helping the office software giant gain a foothold on mobile, where competing tools have dominated.

Microsoft is taking a page from frenemy OpenAI’s playbook. In March, OpenAI announced plans for its own desktop super app to combine ChatGPT, Codex, and its Atlas browser into one central workstation.

Exclusive: Microsoft is building a super app that combines coding, chat, and other Copilot AI tools | Fortune

Exclusive: Microsoft is building a super app that combines coding, chat, and other Copilot AI tools | Fortune

The tool will merge its coding assistant GitHub Copilot, its chat function Copilot, its Copilot Cowork tool, and a new agentic workflow called Autopilot. The move, known internally as “Delivering one Copilot,” will have the dual purpose of simplifying Microsoft’s fragmented desktop AI offerings and finally helping the office software giant gain a foothold on mobile, where competing tools have dominated.

Microsoft is taking a page from frenemy OpenAI’s playbook. In March, OpenAI announced plans for its own desktop super app to combine ChatGPT, Codex, and its Atlas browser into one central workstation.

Rani Molla5/29/26

Forty-two is the answer to life, the universe, and everything in Douglas Adams’ classic “The Hitchhiker’s Guide to the Galaxy.” It’s also the number of unsupervised Robotaxis Tesla has on the road in Texas, the only state where it’s operating autonomous service, according to records from a newly required government database in the state.

That’s much lower than CEO Elon Musk had hoped, as the company struggles to ready its camera-only autonomous vehicles for commercial scale. In 2025, Musk said that the service would be available to “half the population of the US by the end of the year.”

Even smaller competition has more: Avride has 317 and Nuro has 47. Meanwhile, Tesla’s chief rival, Alphabet subsidiary Waymo, has 577 in operation in the state. Nationwide, Waymo’s fleet currently numbers more than 3,000.

Unfortunately for Tesla, figuring out how to actually scale its robotaxi fleet remains the ultimate question.

INDIA-TECHNOLOGY-AI-DIPLOMACY

Anthropic raises $65 billion at a $965 billion valuation, releases a more “honest” Claude Opus 4.8

Anthropic’s monster $965 billion valuation puts it firmly ahead of OpenAI’s $850 billion valuation as the rivals head toward expected IPOs later this year.

Jon Keegan5/28/26

tech

Jon Keegan5/28/26

Report: Microsoft tries to get back in the AI coding game with new model

Microsoft wants to fight its way back into the AI coding field by releasing a new model next week at its annual Microsoft Build developer conference, The Information reports.

The company is expected to announce a new family of models as Microsoft AI CEO Mustafa Suleyman seeks to shore up the company’s own AI offerings and gradually wean it off OpenAI’s technology over the remainder of their $13 billion partnership.

Microsoft was initially well positioned to meet software developers with AI-enhanced tools. It owns GitHub, the most popular platform for hosting and sharing code, and GitHub’s Copilot AI-powered coding tool was released months before OpenAI’s ChatGPT debuted in 2022.

But it fumbled one of the biggest first-mover advantages in history as Anthropic’s Claude Code, OpenAI’s Codex, and Cursor rolled out coding tools that developers loved.

Microsoft to Release New Coding Model Next Week in Comeback Attempt

Microsoft to Release New Coding Model Next Week in Comeback Attempt

Microsoft was initially well positioned to meet software developers with AI-enhanced tools. It owns GitHub, the most popular platform for hosting and sharing code, and GitHub’s Copilot AI-powered coding tool was released months before OpenAI’s ChatGPT debuted in 2022.

But it fumbled one of the biggest first-mover advantages in history as Anthropic’s Claude Code, OpenAI’s Codex, and Cursor rolled out coding tools that developers loved.

Ojai outside

Waymo to launch free robotaxi rides in its new Ojai vans

The new vehicles are less expensive — which is important for the service to really scale.

Rani Molla5/28/26

Latest Stories

Sherwood Media, LLC and Chartr Limited produce fresh and unique perspectives on topical financial news and are fully owned subsidiaries of Robinhood Markets, Inc., and any views expressed here do not necessarily reflect the views of any other Robinhood affiliate, including Robinhood Markets, Inc., Robinhood Financial LLC, Robinhood Securities, LLC, Robinhood Crypto, LLC, Robinhood Money, LLC, Robinhood U.K. Ltd, Robinhood Derivatives, LLC, Robinhood Gold, LLC, Robinhood Asset Management, LLC, Robinhood Credit, Inc., Robinhood Ventures DE, LLC and, where applicable, its managed investment vehicles.

©2026 Sherwood Media, LLC