Tech
tech
Jon Keegan

If AI models can ace every test, it’s actually not a good thing

AI companies are eager to show how much “smarter” and more capable their latest large language models are. To highlight these improvements, companies point to scores on widely used standardized tests like HellaSwg or MMLU as AI benchmarks to flex on how well the tools can write code or solve problems.

But these tools have some serious flaws, with many of the most-cited benchmarks created to test much simpler AI systems years before chatbots like ChatGPT hit the scene. Many of the tools were created by scraping amateur websites and lacked expert oversight. And because these benchmarks have been widely available on the internet for years, AI models have likely seen them as part of their training data.

The Financial Times reports that AI companies are now racing to build a new class of benchmarks, as today’s AI models all score at or above 90% on these tests.

As AI companies start to pivot to a model of AI “agents” that can control your computer and use multistep reasoning, new benchmarks are being created that can test these capabilities in a more meaningful way.

But a lack of regulatory oversight or industry standards means the public must rely on companies like Meta and OpenAI to test their own products for safety, and provides no standard way to compare the results between them.

But these tools have some serious flaws, with many of the most-cited benchmarks created to test much simpler AI systems years before chatbots like ChatGPT hit the scene. Many of the tools were created by scraping amateur websites and lacked expert oversight. And because these benchmarks have been widely available on the internet for years, AI models have likely seen them as part of their training data.

The Financial Times reports that AI companies are now racing to build a new class of benchmarks, as today’s AI models all score at or above 90% on these tests.

As AI companies start to pivot to a model of AI “agents” that can control your computer and use multistep reasoning, new benchmarks are being created that can test these capabilities in a more meaningful way.

But a lack of regulatory oversight or industry standards means the public must rely on companies like Meta and OpenAI to test their own products for safety, and provides no standard way to compare the results between them.

More Tech

See all Tech
tech

California AG launches probe into xAI and Grok over sexualized deepfakes of women and children

The California attorney general just opened an investigation into xAI, Elon Musk’s AI startup, over chatbot Grok’s apparent role in generating nonconsensual sexual images of women and children. The probe centers on reports that Grok has been used to facilitate the creation of sexually explicit images without consent, many of which have circulated on X.

“The avalanche of reports detailing the non-consensual, sexually explicit material that xAI has produced and posted online in recent weeks is shocking,” Attorney General Rob Bonta wrote in a press release. “As the top law enforcement official of California tasked with protecting our residents, I am deeply concerned with this development in AI and will use all the tools at my disposal to keep California’s residents safe.”

California’s move follows growing scrutiny from US lawmakers and the UK government over AI-generated sexual content and deepfakes.

xAI and Tesla CEO Musk earlier today wrote that he was “not aware of any naked underage images generated by Grok. Literally zero.”

Grok is currently No. 5 on Apple’s free App Store.

“The avalanche of reports detailing the non-consensual, sexually explicit material that xAI has produced and posted online in recent weeks is shocking,” Attorney General Rob Bonta wrote in a press release. “As the top law enforcement official of California tasked with protecting our residents, I am deeply concerned with this development in AI and will use all the tools at my disposal to keep California’s residents safe.”

California’s move follows growing scrutiny from US lawmakers and the UK government over AI-generated sexual content and deepfakes.

xAI and Tesla CEO Musk earlier today wrote that he was “not aware of any naked underage images generated by Grok. Literally zero.”

Grok is currently No. 5 on Apple’s free App Store.

tech

Report: Microsoft on track to spend $500 million per year on Anthropic AI

Last fall, Microsoft and OpenAI’s $13 billion partnership seemed to finally be on solid ground.

OpenAI’s restructuring was completed on time, and the companies hammered out an updated agreement that secured OpenAI’s status as Microsoft’s AI provider of choice, but also allowed for Microsoft to work with other companies.

Now Microsoft is doing exactly that. Microsoft has been increasing its spending on Anthropic’s AI, and is on track to spend $500 million per year on the startup’s services, according to a new report from The Information.

The increasingly cozy relationship between the companies includes the rare move of Microsoft offering incentives to its salespeople that allows Anthropic sales to count toward their quotas, per to the report. Microsoft invested $5 billion in Anthropic as part of a big deal in November that included Nvidia.

Microsoft has also been using Anthropic’s AI to power more and more of its own products, such as Github Copilot and 365 Copilot.

Now Microsoft is doing exactly that. Microsoft has been increasing its spending on Anthropic’s AI, and is on track to spend $500 million per year on the startup’s services, according to a new report from The Information.

The increasingly cozy relationship between the companies includes the rare move of Microsoft offering incentives to its salespeople that allows Anthropic sales to count toward their quotas, per to the report. Microsoft invested $5 billion in Anthropic as part of a big deal in November that included Nvidia.

Microsoft has also been using Anthropic’s AI to power more and more of its own products, such as Github Copilot and 365 Copilot.

tech

Report: Apple staggers Siri AI rollout, with key features pushed to summer

Thanks to Apple’s new partnership with Google, the Gemini-backed version of Siri should begin rolling out this spring, but several key features Apple previewed in 2024 may not come until summer, The Information reports.

The new Siri is soon expected to answer general questions with ChatGPT-like answers — rather than quoting directly from websites or not answering at all. But more personalized, proactive features, like, for example, remembering past conversations and information from them to suggest you leave for a planned trip earlier to beat traffic, may not be unveiled until June at the company’s Worldwide Developers Conference.

The report also clarifies that while Apple’s partnership with Microsoft-backed OpenAI, wherein users could summon ChatGPT for complex questions, isn’t changing, the Google deal might reduce the need for people to do so because Siri will likely be able to answer those questions itself. The Information notes, citing a person familiar with the deal, that the ChatGPT option hadn’t driven much traffic to OpenAI before.

The new Siri is soon expected to answer general questions with ChatGPT-like answers — rather than quoting directly from websites or not answering at all. But more personalized, proactive features, like, for example, remembering past conversations and information from them to suggest you leave for a planned trip earlier to beat traffic, may not be unveiled until June at the company’s Worldwide Developers Conference.

The report also clarifies that while Apple’s partnership with Microsoft-backed OpenAI, wherein users could summon ChatGPT for complex questions, isn’t changing, the Google deal might reduce the need for people to do so because Siri will likely be able to answer those questions itself. The Information notes, citing a person familiar with the deal, that the ChatGPT option hadn’t driven much traffic to OpenAI before.

Mark Zuckerberg in the metaverse

RIP the metaverse

Meta seems to be winding down its metaverse ambitions. We took a look back at what the company was going for.

Latest Stories

Sherwood Media, LLC produces fresh and unique perspectives on topical financial news and is a fully owned subsidiary of Robinhood Markets, Inc., and any views expressed here do not necessarily reflect the views of any other Robinhood affiliate, including Robinhood Markets, Inc., Robinhood Financial LLC, Robinhood Securities, LLC, Robinhood Crypto, LLC, or Robinhood Money, LLC.