Tech
tech
Rani Molla

Meta’s not telling where it got its AI training data

Today Meta unleashed its ChatGPT competitor, Meta AI, across its apps and as a standalone. The company boasts that it is running on its latest, greatest AI model, Llama 3, which was trained on “data of the highest quality”! A dataset seven times larger than Llama2! And includes 4 times more code!

What is that training data? There the company is less loquacious.

Meta said the 15 trillion tokens on which its trained came from “publicly available sources.” Which sources? Meta told The Verge’s Alex Heath that it didn’t include Meta user data, but didn’t give much more in the way of specifics.

It did mention that it includes AI-generated data, or synthetic data: “we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3.” There are plenty of known issues with synthetic or AI-created data, foremost of which is that it can exacerbate existing issues with AI, because it’s liable to spit out a more concentrated version of any garbage it is ingesting.

AI companies are turning to such data because there’s not enough good, public data on the entire internet to train their increasingly greedy AI models. (Meta had reportedly floated buying a publisher like Simon & Schuster to satisfy its insatiable data needs.)

Meta, of course, isn’t the only company that’s tight-lipped about where its AI data is coming from. In a now infamous interview with WSJ’s Johanna Stern, OpenAI’s chief technology officer Mira Murati was unable to answer questions about what Sora, OpenAI’s video generating app, was trained on. YouTube? Facebook? Instagram — she said she wasn’t sure.

What is that training data? There the company is less loquacious.

Meta said the 15 trillion tokens on which its trained came from “publicly available sources.” Which sources? Meta told The Verge’s Alex Heath that it didn’t include Meta user data, but didn’t give much more in the way of specifics.

It did mention that it includes AI-generated data, or synthetic data: “we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3.” There are plenty of known issues with synthetic or AI-created data, foremost of which is that it can exacerbate existing issues with AI, because it’s liable to spit out a more concentrated version of any garbage it is ingesting.

AI companies are turning to such data because there’s not enough good, public data on the entire internet to train their increasingly greedy AI models. (Meta had reportedly floated buying a publisher like Simon & Schuster to satisfy its insatiable data needs.)

Meta, of course, isn’t the only company that’s tight-lipped about where its AI data is coming from. In a now infamous interview with WSJ’s Johanna Stern, OpenAI’s chief technology officer Mira Murati was unable to answer questions about what Sora, OpenAI’s video generating app, was trained on. YouTube? Facebook? Instagram — she said she wasn’t sure.

More Tech

See all Tech
tech

Apple cuts sales jobs in rare layoff

Apple is cutting “dozens” of roles from its sales team in a rare layoff, according to a report from Bloomberg. The reductions are aimed at streamlining the company’s sales to businesses, schools, and government accounts, per the report.

Apple rarely turns to layoffs, compared to its tech peers, making the reduction noteworthy.

An Apple spokesperson told Bloomberg: “To connect with even more customers, we are making some changes in our sales team that affect a small number of roles,” and that the employees will be able to apply for new roles in the company.

An Apple spokesperson told Bloomberg: “To connect with even more customers, we are making some changes in our sales team that affect a small number of roles,” and that the employees will be able to apply for new roles in the company.

tech

Anthropic releases Claude Opus 4.5 as AI war heats up

The past few weeks have seen new, impressive AI models debut from OpenAI and Google. Today it’s Anthropic’s turn to flex, as it releases Claude Opus 4.5, the latest iteration of its flagship AI model.

Anthropic’s Claude model is widely considered to be among the best at coding, and this model helps the company stay at the head of the pack.

Benchmarks released by Anthropic show Opus 4.5 besting both GPT-5.1 and Gemini 3 with an all-time high score of 80% and the widely used SWE-bench coding benchmark. It also posted high scores for benchmarks measuring computer use and the notoriously challenging ARC-AGI-2 visual problem-solving test, though apparently it can’t run a vending machine as profitably as Google’s Gemini 3 can.

AI coding is one of the few bright spots as companies seek profitable enterprise applications for AI that actually improve productivity. Anthropic’s success with enterprise customers has helped push its valuation to nearly $350 billion.

Benchmarks released by Anthropic show Opus 4.5 besting both GPT-5.1 and Gemini 3 with an all-time high score of 80% and the widely used SWE-bench coding benchmark. It also posted high scores for benchmarks measuring computer use and the notoriously challenging ARC-AGI-2 visual problem-solving test, though apparently it can’t run a vending machine as profitably as Google’s Gemini 3 can.

AI coding is one of the few bright spots as companies seek profitable enterprise applications for AI that actually improve productivity. Anthropic’s success with enterprise customers has helped push its valuation to nearly $350 billion.

tech

Amazon plans to invest up to $50 billion for “AI and supercomputing infrastructure” for the US government

Amazon said it will invest up to $50 billion to build out its AI computing infrastructure for the US government.

Based on the company’s AWS cloud platform, Amazon will help build up to 1.3 gigawatts of dedicated AI high-performance computing infrastructure, according to a press release announcing the plans.

The project, which will including building new data centers, is set to break ground in 2026.

Amazon AWS CEO Matt Garman said:

“We’re giving agencies expanded access to advanced AI capabilities that will enable them to accelerate critical missions from cybersecurity to drug discovery. This investment removes the technology barriers that have held government back and further positions America to lead in the AI era.

The new computing capacity will be available to agencies through use of AWS government products AWS Top Secret, AWS Secret, and GovCloud Regions.

The project, which will including building new data centers, is set to break ground in 2026.

Amazon AWS CEO Matt Garman said:

“We’re giving agencies expanded access to advanced AI capabilities that will enable them to accelerate critical missions from cybersecurity to drug discovery. This investment removes the technology barriers that have held government back and further positions America to lead in the AI era.

The new computing capacity will be available to agencies through use of AWS government products AWS Top Secret, AWS Secret, and GovCloud Regions.

tech

Amazon now has 900 data centers spread across 50 countries, report says

The exact size and shape of Amazon’s AWS global network of data centers has always been a closely guarded secret. A new report from Bloomberg and SourceMaterial sheds some light on AWS’s global reach.

Based on internal documents seen by Bloomberg, Amazon’s cloud operations include more than 900 data centers spread across 50 countries.

Amazon owns the majority of its data centers, but contracts with at least 180 different colocation entities, according to the report.

Latest Stories

Sherwood Media, LLC produces fresh and unique perspectives on topical financial news and is a fully owned subsidiary of Robinhood Markets, Inc., and any views expressed here do not necessarily reflect the views of any other Robinhood affiliate, including Robinhood Markets, Inc., Robinhood Financial LLC, Robinhood Securities, LLC, Robinhood Crypto, LLC, or Robinhood Money, LLC.