Meta’s not telling where it got its AI training data

Today Meta unleashed its ChatGPT competitor, Meta AI, across its apps and as a standalone. The company boasts that it is running on its latest, greatest AI model, Llama 3, which was trained on “data of the highest quality”! A dataset seven times larger than Llama2! And includes 4 times more code!

What is that training data? There the company is less loquacious.

Meta said the 15 trillion tokens on which its trained came from “publicly available sources.” Which sources? Meta told The Verge’s Alex Heath that it didn’t include Meta user data, but didn’t give much more in the way of specifics.

It did mention that it includes AI-generated data, or synthetic data: “we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3.” There are plenty of known issues with synthetic or AI-created data, foremost of which is that it can exacerbate existing issues with AI, because it’s liable to spit out a more concentrated version of any garbage it is ingesting.

AI companies are turning to such data because there’s not enough good, public data on the entire internet to train their increasingly greedy AI models. (Meta had reportedly floated buying a publisher like Simon & Schuster to satisfy its insatiable data needs.)

Meta, of course, isn’t the only company that’s tight-lipped about where its AI data is coming from. In a now infamous interview with WSJ’s Johanna Stern, OpenAI’s chief technology officer Mira Murati was unable to answer questions about what Sora, OpenAI’s video generating app, was trained on. YouTube? Facebook? Instagram — she said she wasn’t sure.

Meta’s battle with ChatGPT begins now

What is that training data? There the company is less loquacious.

Meta’s not telling where it got its AI training data

Meta’s battle with ChatGPT begins now

More Tech

Apple cuts sales jobs in rare layoff

Apple Cuts Jobs Across Its Sales Organization in Rare Layoff

Anthropic releases Claude Opus 4.5 as AI war heats up

Introducing Claude Opus 4.5

Amazon plans to invest up to $50 billion for “AI and supercomputing infrastructure” for the US government

Amazon to invest up to $50 billion to expand AI and supercomputing infrastructure for US government agencies

Amazon now has 900 data centers spread across 50 countries, report says

Amazon Data Center Tally Tops 900 Amid AI Frenzy, Documents Show

Latest Stories