Scraping off scrapers… The cat-and-mouse game of AI bots and bot blockers is getting more complicated. AI web crawlers — bots that scan and scrape content to train genAI products — are popping up faster than websites can block them. Thousands of sites have blocked crawlers from companies including OpenAI, Anthropic, and Nvidia. But as new bots roll out, it’s become nearly impossible to stop them from hoovering up content without permission.
Whac-a-bot: Media giant Condé Nast has blocked two of Anthropic’s crawlers, but its newer third bot isn’t blocked. And the crawlers keep comin’: Meta and Apple recently unleashed new crawlers.
Crawlin’ costs: By “visiting” sites millions of times, bots can also saddle website hosts with big fees. One coding site said AI scrapers accessed 73 terabytes’ worth of its files in May, costing it $5K+ in bandwidth overage charges.
Bot blocking is booming… because sites don’t want their content used without permission. About 90% of top news outlets block AI crawlers, and Reddit has blocked nearly all web crawlers except for Google’s (this year Reddit agreed to a $60M deal to make its content available for Google’s AI training). Bot-blocking help has become a selling point too: Cloudfare, which just released an “easy button” for blocking AI bots, said 85% of its cloud customers block scrapers.
No.thx: Block attempts aren’t guaranteed to work. Crawlers from OpenAI, Anthropic, and Perplexity AI have all been accused of bypassing efforts to block them.
If you can’t beat ’em, charge ’em… Some media companies have avoided blocking headaches by striking content-licensing deals with AI companies. This year Wall Street Journal owner News Corp. signed a $250M deal with OpenAI, which signed a similar ~$25M pact with Business Insider publisher Axel Springer. Others have opted to fight: last year The New York Times slapped OpenAI and Microsoft with a copyright-infringement lawsuit.