OpenAI’s internal Slack messages could cost it billions in copyright suit
Authors and publishers suing OpenAI over copyright infringement were granted access to internal OpenAI communications about the deletion of a pirated books dataset, and now want to review attorney communications.
Among the many copyright infringement cases that AI companies are defending themselves against, one piece of evidence keeps popping up: their use of pirated book databases like LibGen to train AI models.
The plaintiffs didn’t have to look far to discover this information, as early AI company research papers often freely mentioned the use of them. Since these research documents have been cited as evidence in lawsuits, AI companies have been far more cautious when they discuss the training materials used to build their models.
LibGen is again at the center of an AI copyright case, this time in a lawsuit filed by authors and publishers against OpenAI. Bloomberg Law reports that the plaintiffs got ahold of internal OpenAI emails and Slack messages that discussed deleting the LibGen data.
In an extraordinary move, the plaintiffs have asked the judge for access to the communications between OpenAI and its attorneys, by invoking a “crime-fraud” exemption to privilege. The plaintiffs want to know if the lawyers told OpenAI to delete the dataset, which could be construed as intentional destruction of evidence. As Bloomberg reported, if that is allowed, and shown to be the case, OpenAI could be exposed to charges of “willful infringement,” which could enhance penalties up to $150,000 per work, as well as steep sanctions from the court.
OpenAI pushed back on the allegations in a letter to the judge, adamantly denying that it had waived attorney-client privilege. In the letter, OpenAI’s attorneys wrote:
“OpenAI has consistently maintained that the reasons for the removal are privileged because they were legal decisions made in consultation with counsel. At no point has OpenAI disclosed or relied on the privileged reasons or retracted its position, and it has made clear that there are no non-privileged reasons.”
After reviewing documents in private, US Magistrate Judge Ona Wang allowed some communications to be withheld, but ordered other messages to be produced. The case remains ongoing.
It’s not the first time that a Big Tech company’s internal communications surrounding its use of copyrighted material showed up as evidence in a lawsuit. According to messages revealed in discovery during a copyright case, Meta’s researchers expressed reservations about using LibGen, describing it as a “data set we know to be pirated.” Per the filings, the issue was escalated to “MZ,” who approved the pirated library’s use.
In June, a federal judge in the Northern District of California ruled that Anthropic did not violate the copyright of a group of authors when it used their works for training its Claude AI model — but only for the books that the company actually purchased, scanned, and ingested. The other works that Anthropic used to train its model from a pirated book dataset dubbed “The Pile” were found to not fall under “fair use” and called for a separate trial. In August, the company announced a $1.5 billion settlement with the authors that could end up costing it quite a bit more after class-action claims are calculated.