Tech
tech

Meta scrambling to defend its AI after Llama 4 benchmark bungle

This weekend, Meta surprised everyone and released two flavors (“Maverick” medium and “Scout” small) of its highly anticipated Llama 4 AI model. Llama 4’s release is a big deal, as the company has been hyping it up as the key to its AI plans in the coming year.

When a major new model drops, people do two things: check to see how the model scored on major benchmarks, and load up the model and kick the tires.

Llama 4’s benchmark scored some eye-popping results for ChatbotArea, a popular human-powered benchmark that’s a sort of blind taste test for AI models with side-by-side results. But after looking at the fine print, some in the community cried foul, as Meta achieved the higher score using an “experimental chat version” of Llama 4 that was not available to the public.

A footnote to a chart that highlighted Llama 4’s standout score read “LMArena testing was conducted using Llama 4 Maverick optimized for conversationality.”

In response to the controversy, LMArena (which runs the Chatbot Arena benchmark) updated its guidelines for testing:

“Meta’s interpretation of our policy did not match what we expect from model providers. Meta should have made it clearer that ‘Llama-4-Maverick-03-26-Experimental’ was a customized model to optimize for human preference. As a result of that we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future.”

This led to some unfounded accusations that Meta had trained its model on test datasets — akin to giving a kid the answers to a quiz before having them take the test.

To quell the firestorm of questions surrounding the model’s release, Meta’s head of generative AI, Ahmad Al-Dahle, refuted the claims in a post on X yesterday.

The release was also unusual for what was missing from the release: the extra-large version of the model named “Behemoth.” Meta said the model was still being trained, but boasted about its performance nonetheless.

“Llama 4 Behemoth outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks. Llama 4 Behemoth is still training, and we’re excited to share more details about it even while it’s still in flight.”

Meta did not immediately respond to a request for comment.

More Tech

See all Tech
tech

Report: Google DeepMind builds “strike team” to catch up to Anthropic models

Anthropic’s recent momentum, powered by the success of its popular Claude Code tool, is turning up the heat among its AI competitors — not only for its AI startup peer OpenAI, but also with established Big Tech giants like Google.

The Information reports that within Google DeepMind, a “strike team” has been assembled to make a serious push to improve Gemini’s coding capabilities. According to the report, leaders within Google, including cofounder Sergey Brin, are sounding the alarm after determining that Anthropic’s Claude has superior coding skills. The new team’s goal is to create a AI system that can improve itself.

“To win the final sprint, we must urgently bridge the gap in agentic execution and turn our models into primary developers,” Brin wrote in a recent memo to DeepMind staff.

The Information reports that within Google DeepMind, a “strike team” has been assembled to make a serious push to improve Gemini’s coding capabilities. According to the report, leaders within Google, including cofounder Sergey Brin, are sounding the alarm after determining that Anthropic’s Claude has superior coding skills. The new team’s goal is to create a AI system that can improve itself.

“To win the final sprint, we must urgently bridge the gap in agentic execution and turn our models into primary developers,” Brin wrote in a recent memo to DeepMind staff.

$0

Tesla’s federal tax bill last year was once again $0, Reuters reports. While past losses and green energy credits helped shrink the bill, Reuters found that Tesla also leaned on a classic corporate maneuver: offshore profit-shifting. By routing intellectual property rights through paper-only subsidiaries in the Netherlands and Singapore, Tesla effectively parked $18 billion in profits overseas between 2023 and early 2025. The entirely legal setup saved Tesla an estimated $400 million in US taxes. Not bad for a company whose CEO is not a fan of “shady” tax loopholes.

tech

Report: NSA is currently using Anthropic’s unreleased Mythos model

According to the Pentagon, Anthropic’s AI tools are a national security supply chain risk, and have been banned for defense applications.

But a new report says the National Security Agency, which operates as a part of the Pentagon, is currently busy using Anthropic’s new, unreleased AI model, Mythos.

Axios reports that Mythos’ reputed advanced offensive cyber capabilities have compelled the NSA to begin using it, despite the public blacklisting from the Pentagon, which Anthropic is suing the US government over.

Anthropic has granted access to a small number of trusted partners to test and prepare for the expected explosion of vulnerabilities to be discovered using the new AI model. UK intelligence agencies have also reportedly gained access to Mythos.

Anthropic CEO Dario Amodei reportedly visited the White House last week to try and resolve the dispute on allowing wider use of the company’s technology in the federal government.

Axios reports that Mythos’ reputed advanced offensive cyber capabilities have compelled the NSA to begin using it, despite the public blacklisting from the Pentagon, which Anthropic is suing the US government over.

Anthropic has granted access to a small number of trusted partners to test and prepare for the expected explosion of vulnerabilities to be discovered using the new AI model. UK intelligence agencies have also reportedly gained access to Mythos.

Anthropic CEO Dario Amodei reportedly visited the White House last week to try and resolve the dispute on allowing wider use of the company’s technology in the federal government.

Latest Stories

Sherwood Media, LLC produces fresh and unique perspectives on topical financial news and is a fully owned subsidiary of Robinhood Markets, Inc., and any views expressed here do not necessarily reflect the views of any other Robinhood affiliate, including Robinhood Markets, Inc., Robinhood Financial LLC, Robinhood Securities, LLC, Robinhood Crypto, LLC, Robinhood Derivatives, LLC, or Robinhood Money, LLC. Futures and event contracts are offered through Robinhood Derivatives, LLC.