Tech
tech
Rani Molla

Meta’s not telling where it got its AI training data

Today Meta unleashed its ChatGPT competitor, Meta AI, across its apps and as a standalone. The company boasts that it is running on its latest, greatest AI model, Llama 3, which was trained on “data of the highest quality”! A dataset seven times larger than Llama2! And includes 4 times more code!

What is that training data? There the company is less loquacious.

Meta said the 15 trillion tokens on which its trained came from “publicly available sources.” Which sources? Meta told The Verge’s Alex Heath that it didn’t include Meta user data, but didn’t give much more in the way of specifics.

It did mention that it includes AI-generated data, or synthetic data: “we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3.” There are plenty of known issues with synthetic or AI-created data, foremost of which is that it can exacerbate existing issues with AI, because it’s liable to spit out a more concentrated version of any garbage it is ingesting.

AI companies are turning to such data because there’s not enough good, public data on the entire internet to train their increasingly greedy AI models. (Meta had reportedly floated buying a publisher like Simon & Schuster to satisfy its insatiable data needs.)

Meta, of course, isn’t the only company that’s tight-lipped about where its AI data is coming from. In a now infamous interview with WSJ’s Johanna Stern, OpenAI’s chief technology officer Mira Murati was unable to answer questions about what Sora, OpenAI’s video generating app, was trained on. YouTube? Facebook? Instagram — she said she wasn’t sure.

What is that training data? There the company is less loquacious.

Meta said the 15 trillion tokens on which its trained came from “publicly available sources.” Which sources? Meta told The Verge’s Alex Heath that it didn’t include Meta user data, but didn’t give much more in the way of specifics.

It did mention that it includes AI-generated data, or synthetic data: “we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3.” There are plenty of known issues with synthetic or AI-created data, foremost of which is that it can exacerbate existing issues with AI, because it’s liable to spit out a more concentrated version of any garbage it is ingesting.

AI companies are turning to such data because there’s not enough good, public data on the entire internet to train their increasingly greedy AI models. (Meta had reportedly floated buying a publisher like Simon & Schuster to satisfy its insatiable data needs.)

Meta, of course, isn’t the only company that’s tight-lipped about where its AI data is coming from. In a now infamous interview with WSJ’s Johanna Stern, OpenAI’s chief technology officer Mira Murati was unable to answer questions about what Sora, OpenAI’s video generating app, was trained on. YouTube? Facebook? Instagram — she said she wasn’t sure.

More Tech

See all Tech
tech

Apple’s hardware chief is the front-runner to be the next CEO

The New York Times is the latest news organization to cite Apple sources who think the company’s hardware chief, John Ternus, will be the one to fill CEO Tim Cook’s shoes. Citing people close to Apple, the publication reports that Cook is “tired and would like to reduce his workload” and that 50-year-old Ternus is the most likely to take his place, as the company accelerates its succession planning.

The Times is in good company. Both the Financial Times and Bloomberg have previously said Ternus is the top pick to succeed Cook at the helm of the tech giant, and Ternus is currently enjoying the top spot on prediction markets. His market-implied odds of being the next CEO are currently above 60% on both Polymarket and Kalshi event contracts.

The Times is in good company. Both the Financial Times and Bloomberg have previously said Ternus is the top pick to succeed Cook at the helm of the tech giant, and Ternus is currently enjoying the top spot on prediction markets. His market-implied odds of being the next CEO are currently above 60% on both Polymarket and Kalshi event contracts.

tech

Morgan Stanley: Even with Nvidia’s autonomous tech, Tesla is still “years ahead” of other automakers

Nvidia’s latest autonomous tech may help traditional automakers close the distance to manufacturing driverless cars, but not to Tesla, a research note from Morgan Stanley contends. Analyst Andrew Percoco argued that while Nvidia’s tech stack offers a “capital efficient on ramp to advanced autonomy,” that still leaves automakers stuck in a “faster follower strategy.”

According to the analyst, “Tesla is years ahead of competitors when it comes to autonomy with a clear data and scale advantage.” The comment is similar to something Tesla CEO Elon Musk said in the wake of Nvidia’s announcements:

“This is maybe a competitive pressure on Tesla in 5 or 6 years, but probably longer,” Musk posted on X.

Latest Stories

Sherwood Media, LLC produces fresh and unique perspectives on topical financial news and is a fully owned subsidiary of Robinhood Markets, Inc., and any views expressed here do not necessarily reflect the views of any other Robinhood affiliate, including Robinhood Markets, Inc., Robinhood Financial LLC, Robinhood Securities, LLC, Robinhood Crypto, LLC, or Robinhood Money, LLC.