Tech
tech
Rani Molla

Meta’s not telling where it got its AI training data

Today Meta unleashed its ChatGPT competitor, Meta AI, across its apps and as a standalone. The company boasts that it is running on its latest, greatest AI model, Llama 3, which was trained on “data of the highest quality”! A dataset seven times larger than Llama2! And includes 4 times more code!

What is that training data? There the company is less loquacious.

Meta said the 15 trillion tokens on which its trained came from “publicly available sources.” Which sources? Meta told The Verge’s Alex Heath that it didn’t include Meta user data, but didn’t give much more in the way of specifics.

It did mention that it includes AI-generated data, or synthetic data: “we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3.” There are plenty of known issues with synthetic or AI-created data, foremost of which is that it can exacerbate existing issues with AI, because it’s liable to spit out a more concentrated version of any garbage it is ingesting.

AI companies are turning to such data because there’s not enough good, public data on the entire internet to train their increasingly greedy AI models. (Meta had reportedly floated buying a publisher like Simon & Schuster to satisfy its insatiable data needs.)

Meta, of course, isn’t the only company that’s tight-lipped about where its AI data is coming from. In a now infamous interview with WSJ’s Johanna Stern, OpenAI’s chief technology officer Mira Murati was unable to answer questions about what Sora, OpenAI’s video generating app, was trained on. YouTube? Facebook? Instagram — she said she wasn’t sure.

What is that training data? There the company is less loquacious.

Meta said the 15 trillion tokens on which its trained came from “publicly available sources.” Which sources? Meta told The Verge’s Alex Heath that it didn’t include Meta user data, but didn’t give much more in the way of specifics.

It did mention that it includes AI-generated data, or synthetic data: “we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3.” There are plenty of known issues with synthetic or AI-created data, foremost of which is that it can exacerbate existing issues with AI, because it’s liable to spit out a more concentrated version of any garbage it is ingesting.

AI companies are turning to such data because there’s not enough good, public data on the entire internet to train their increasingly greedy AI models. (Meta had reportedly floated buying a publisher like Simon & Schuster to satisfy its insatiable data needs.)

Meta, of course, isn’t the only company that’s tight-lipped about where its AI data is coming from. In a now infamous interview with WSJ’s Johanna Stern, OpenAI’s chief technology officer Mira Murati was unable to answer questions about what Sora, OpenAI’s video generating app, was trained on. YouTube? Facebook? Instagram — she said she wasn’t sure.

More Tech

See all Tech
tech

Report: SpaceX posted $18.5 billion in revenue and a $5 billion loss last year

All eyes on are SpaceX as it prepares for a blockbuster IPO as soon as this summer, and everyone is eager to get a look at the company’s official numbers for the first time.

The Information is reporting that last year, SpaceX posted $18.5 billion in revenue with a $5 billion loss.

According to the report, the numbers reflect the combined finances of SpaceX and xAI, which it acquired in February.

After acquiring xAI, SpaceX’s successful space launch and satellite business may have been dragged down by xAI’s massive data center spending. Earlier this year, Bloomberg reported that xAI had burned through $8 billion in the first nine months of 2025.

According to the report, the numbers reflect the combined finances of SpaceX and xAI, which it acquired in February.

After acquiring xAI, SpaceX’s successful space launch and satellite business may have been dragged down by xAI’s massive data center spending. Earlier this year, Bloomberg reported that xAI had burned through $8 billion in the first nine months of 2025.

tech

Report: Amazon hopes its Project Houdini modular data center plan is the trick to speed up construction

Amazon is looking for a magic trick that can help it get past data center construction bottlenecks so it can work through the $244 billion worth of cloud computing backlogs it wants to deliver.

It may have just pulled a rabbit out of its hat. (I know, groan.)

Business Insider is reporting that Amazon’s Project Houdini seeks to slash labor costs and installation time by building modular “data halls” — the rows of racks of servers that make up the heart of data centers — in factories, and then shipping them fully assembled on trailers to data center sites.

According to the report, the modular plan would save weeks of construction time and tens of thousands of hours of labor costs.

This week in Amazon’s letter to shareholders, CEO Andy Jassy wrote that the company is planning $200 billion in capital expenditure this year, and that it is embracing its tradition of taking big bets on experiments like Project Houdini:

“You need to invent and experiment like crazy. Many of these experiments will fail, and it might feel like you’re getting nowhere. But, your culture must possess the tenacity to keep at it.”

Business Insider is reporting that Amazon’s Project Houdini seeks to slash labor costs and installation time by building modular “data halls” — the rows of racks of servers that make up the heart of data centers — in factories, and then shipping them fully assembled on trailers to data center sites.

According to the report, the modular plan would save weeks of construction time and tens of thousands of hours of labor costs.

This week in Amazon’s letter to shareholders, CEO Andy Jassy wrote that the company is planning $200 billion in capital expenditure this year, and that it is embracing its tradition of taking big bets on experiments like Project Houdini:

“You need to invent and experiment like crazy. Many of these experiments will fail, and it might feel like you’re getting nowhere. But, your culture must possess the tenacity to keep at it.”

tech

Creator of popular, mysterious “HappyHorse” text-to-video model is Alibaba

AI benchmark leaderboards are often where mysterious new models make their debut, stoking speculation about the unnamed companies behind them.

That was the case with an impressive new text-to-video model named HappyHorse-1.0 that shot to the top of public leaderboards. CNBC reports that Chinese tech giant Alibaba has confirmed that it is the owner of the new model.

HappyHorse beat out the popular Seedance model from rival ByteDance in blind human evaluations to claim the top spot on the Artificial Analysis text-to-video leaderboard.

While OpenAI has announced it is shuttering its text-to-video Sora app, the category continues to see intense competition as a flurry of video models improve with more realistic physics and cinematic effects.

HappyHorse beat out the popular Seedance model from rival ByteDance in blind human evaluations to claim the top spot on the Artificial Analysis text-to-video leaderboard.

While OpenAI has announced it is shuttering its text-to-video Sora app, the category continues to see intense competition as a flurry of video models improve with more realistic physics and cinematic effects.

tech

OpenAI: Our new AI tool is too dangerous to release, too!

This week, Anthropic warned that it had developed a new model that was too dangerous to cybersecurity to be released to the public.

According to a new report, OpenAI is saying similar things about a new cybersecurity tool it is working on (separate from its rumored forthcoming Spud model).

Axios wrote that OpenAI is allowing a small group of partners to test its new AI tool, which has “advanced cybersecurity capabilities.”

The realization that we have arrived at an era of powerful new AI models that could overwhelm current cybersecurity defenses is spooking investors, with cybersecurity stocks like Cloudflare, Zscaler, CrowdStrike, and Palo Alto Networks all down sharply this morning.

Axios wrote that OpenAI is allowing a small group of partners to test its new AI tool, which has “advanced cybersecurity capabilities.”

The realization that we have arrived at an era of powerful new AI models that could overwhelm current cybersecurity defenses is spooking investors, with cybersecurity stocks like Cloudflare, Zscaler, CrowdStrike, and Palo Alto Networks all down sharply this morning.

tech

OpenAI’s Stargate shrinks further as UK data center “paused”

OpenAI’s ambitious Stargate global data center project just got smaller.

First announced at the White House alongside President Trump at the start of his second term, the OpenAI partnership with Oracle and SoftBank sought to build massive data centers around the world, including sites in the UAE, the UK, and Norway.

Bloomberg reports that the company is “pausing” the Stargate UK project, citing high energy costs and regulatory obstacles.

Last month, the company and its partner Oracle scrapped its planned expansion of the Stargate I data center site in Abilene, Texas.

In a statement to Bloomberg, the company said:

“AI compute is foundational to that goal — we continue to explore Stargate UK and will move forward when the right conditions such as regulation and the cost of energy enable long-term infrastructure investment.”

Stargate UK was announced in September, including a partnership with Nvidia and Nscale that would scale up to 31,000 GPUs.

Bloomberg reports that the company is “pausing” the Stargate UK project, citing high energy costs and regulatory obstacles.

Last month, the company and its partner Oracle scrapped its planned expansion of the Stargate I data center site in Abilene, Texas.

In a statement to Bloomberg, the company said:

“AI compute is foundational to that goal — we continue to explore Stargate UK and will move forward when the right conditions such as regulation and the cost of energy enable long-term infrastructure investment.”

Stargate UK was announced in September, including a partnership with Nvidia and Nscale that would scale up to 31,000 GPUs.

Latest Stories

Sherwood Media, LLC produces fresh and unique perspectives on topical financial news and is a fully owned subsidiary of Robinhood Markets, Inc., and any views expressed here do not necessarily reflect the views of any other Robinhood affiliate, including Robinhood Markets, Inc., Robinhood Financial LLC, Robinhood Securities, LLC, Robinhood Crypto, LLC, Robinhood Derivatives, LLC, or Robinhood Money, LLC. Futures and event contracts are offered through Robinhood Derivatives, LLC.