OpenAI May Be Compelled to Explain Deletion of Pirated Book Datasets

Key Points

  • OpenAI deleted two internal datasets built from Library Genesis content before ChatGPT’s 2022 release.
  • Authors allege the datasets were used to train ChatGPT without permission, prompting a class‑action lawsuit.
  • OpenAI first cited “non‑use” as the reason for deletion, then claimed the reason is protected by attorney‑client privilege.
  • U.S. District Judge Ona Wang ordered OpenAI to disclose internal communications about the deletion.
  • The case may set precedent for how AI companies handle privileged communications in copyright litigation.

OpenAI desperate to avoid explaining why it deleted pirated book datasets

Background

OpenAI created two internal datasets, known as “Books 1” and “Books 2,” in 2021. The datasets were assembled by scraping the open web and incorporating material from Library Genesis, a well‑known shadow library that hosts pirated books. OpenAI later deleted the datasets before the public release of ChatGPT in 2022.

Legal Developments

Authors have filed a class‑action lawsuit claiming that OpenAI illegally used their copyrighted works to train ChatGPT. The plaintiffs seek to understand why OpenAI removed the datasets, arguing that the reason for deletion could be pivotal to their case. OpenAI initially asserted that the datasets were removed because they were no longer in use, but subsequently argued that any reason for deletion, including “non‑use,” is shielded by attorney‑client privilege.

U.S. District Judge Ona Wang ordered OpenAI to turn over all communications with in‑house counsel concerning the deletion, as well as any internal references to Library Genesis that the company may have redacted or withheld under the privilege claim. The judge noted that OpenAI’s contradictory statements—first denying that “non‑use” was a reason for deletion and later treating it as a privileged reason—raised concerns about the company’s transparency.

Implications

If the court requires OpenAI to disclose its internal discussions, the authors could gain insight into the company’s decision‑making process and potentially strengthen their claims that the training data violated copyright law. The outcome may also set a precedent for how technology firms handle privileged communications when faced with litigation over data usage.

OpenAI’s handling of the situation reflects a broader tension between rapid AI development and adherence to intellectual‑property rights. The case highlights the legal challenges that arise when large‑scale language models are trained on publicly scraped content that may include copyrighted material.

Source: arstechnica.com