Tech
tech

Meta scrambling to defend its AI after Llama 4 benchmark bungle

This weekend, Meta surprised everyone and released two flavors (“Maverick” medium and “Scout” small) of its highly anticipated Llama 4 AI model. Llama 4’s release is a big deal, as the company has been hyping it up as the key to its AI plans in the coming year.

When a major new model drops, people do two things: check to see how the model scored on major benchmarks, and load up the model and kick the tires.

Llama 4’s benchmark scored some eye-popping results for ChatbotArea, a popular human-powered benchmark that’s a sort of blind taste test for AI models with side-by-side results. But after looking at the fine print, some in the community cried foul, as Meta achieved the higher score using an “experimental chat version” of Llama 4 that was not available to the public.

A footnote to a chart that highlighted Llama 4’s standout score read “LMArena testing was conducted using Llama 4 Maverick optimized for conversationality.”

In response to the controversy, LMArena (which runs the Chatbot Arena benchmark) updated its guidelines for testing:

“Meta’s interpretation of our policy did not match what we expect from model providers. Meta should have made it clearer that ‘Llama-4-Maverick-03-26-Experimental’ was a customized model to optimize for human preference. As a result of that we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future.”

This led to some unfounded accusations that Meta had trained its model on test datasets — akin to giving a kid the answers to a quiz before having them take the test.

To quell the firestorm of questions surrounding the model’s release, Meta’s head of generative AI, Ahmad Al-Dahle, refuted the claims in a post on X yesterday.

The release was also unusual for what was missing from the release: the extra-large version of the model named “Behemoth.” Meta said the model was still being trained, but boasted about its performance nonetheless.

“Llama 4 Behemoth outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks. Llama 4 Behemoth is still training, and we’re excited to share more details about it even while it’s still in flight.”

Meta did not immediately respond to a request for comment.

More Tech

See all Tech
tech

FT: Meta considering “tens of billions” in new capital to fund AI

Just days after Google announced a monster $85 billion upsized equity raise, the extremely profitable Meta is seeking to sell “tens of billions of dollars” in stock, according to a new report from the Financial Times.

Meta is planning on spending between $125 billion and $145 billion on AI capital expenditure this year alone.

Shares dropped more than 5% on the news.

tech

FT: Anthropic staff helping the NSA use Mythos for offensive cyberattacks

Anthropic’s Mythos AI model was deemed too dangerous to release to the public, with the company citing its ability to orchestrate novel cyberattacks.

And that’s just what the National Security Agency is doing, with the help of Anthropic staff embedded at the agency, according to a report from the Financial Times.

Only a small number of companies and US allies have been given access to the advanced model, which means America’s adversaries have not had the chance to shore up their defenses against the AI model’s new offensive capabilities.

The arrangement is especially unusual as the Pentagon has deemed Anthropic’s AI a national security supply chain risk — effectively blacklisting it for defense work — in response to the company’s refusal to allow its technology to be used for any legal application, which could include autonomous killing or mass surveillance. Anthropic is currently suing the US government to fight the determination.

Only a small number of companies and US allies have been given access to the advanced model, which means America’s adversaries have not had the chance to shore up their defenses against the AI model’s new offensive capabilities.

The arrangement is especially unusual as the Pentagon has deemed Anthropic’s AI a national security supply chain risk — effectively blacklisting it for defense work — in response to the company’s refusal to allow its technology to be used for any legal application, which could include autonomous killing or mass surveillance. Anthropic is currently suing the US government to fight the determination.

tech

Longtime Tesla bear JPMorgan upgraded Tesla and raised its price target to $475 from $145

For more than a decade, JPMorgan was Wall Streets most stubborn Tesla skeptic, anchored by auto analyst Ryan Brinkman’s strict focus on traditional car fundamentals and near-term delivery numbers.

But JPM recently handed coverage of the stock to a new analyst, Rajat Gupta, who is throwing that playbook out the window. In a note Friday, the firm upgraded Tesla to neutral from underweight and raised its price target 228% to $475 from $145. (The analyst consensus on FactSet is $403.) Instead of focusing on the company’s struggling vehicle business, the new analyst is orienting himself more toward Tesla’s idea of the future, now modeling Tesla’s physical AI and robotaxi fleets all the way out to the year 2040.

Here are the main reasons for the capitulation:

  • Looking past the car lot: Gupta argues that Tesla is at the forefront of physical AI, entering uncharted TAMs” and therefore deserves the benefit of the doubt to be valued on LT earnings potential rather than near-term speed bumps.

  • Unmatched vertical integration: Teslas control over everything from battery cells to custom silicon gives it a massive moat. JPM notes this starting point advantage is unmatched at an industrial level scale” and “still somewhat under-appreciated and misunderstood.

  • The AWS flywheel effect: Deploying Optimus robots inside its own factories should not only lower COGS for the base automotive business, but more importantly, help validate the product at an industrial scale.” Gupta called it “a classic flywheel effect, somewhat analogous to AWS and Kiva at AMZN.

For Tesla bulls who have argued for years that this is an AI company and not a carmaker, JPM’s sudden $3.9 trillion valuation model is the ultimate validation.

skynet terminator

Anthropic ponders self-improving AI

Anthropic says Claude already writes 80% of its code. A new post asks what happens when the models can improve themselves — and whether anyone could stop them.

Latest Stories

Sherwood Media, LLC and Chartr Limited produce fresh and unique perspectives on topical financial news and are fully owned subsidiaries of Robinhood Markets, Inc., and any views expressed here do not necessarily reflect the views of any other Robinhood affiliate, including Robinhood Markets, Inc., Robinhood Financial LLC, Robinhood Securities, LLC, Robinhood Crypto, LLC, Robinhood Money, LLC, Robinhood U.K. Ltd, Robinhood Derivatives, LLC, Robinhood Gold, LLC, Robinhood Asset Management, LLC, Robinhood Credit, Inc., Robinhood Ventures DE, LLC and, where applicable, its managed investment vehicles.