The copyright battles against OpenAI have begun

Two authors are suing OpenAI, claiming that ChatGPT has unlawfully digested their books as part of its training data

We may earn a commission from links on this page.
Image for article titled The copyright battles against OpenAI have begun
Illustration: Dado Ruvic (Reuters)

Two novelists, Paul Tremblay and Mona Awad, have filed a lawsuit against OpenAI in a San Francisco federal court, alleging that its ChatGPT large language model was trained using data from their copyrighted books without consent.

Tremblay, the author of “The Cabin at the End of the World,” and Awad, the author of “Bunny” and “13 Ways of Looking at a Fat Girl,” claim in their 16-page class action suit (pdf) dated June 28 that ChatGPT generates very accurate summaries of their literary works when prompted.


The writers said this is “only possible” if ChatGPT was trained on the content in their books, which would amount to a breach of the federal copyright law. As a result, they add, OpenAI stands to “benefit commercial[ly] and profit richly” from the use of their copyrighted materials. Andres Guadamuz, an intellectual property scholar at the University of Susse, told the Guardian that this is the first copyright-related legal claim against OpenAI. But it is very unlikely to be the last.

Is OpenAI getting in trouble for using copyrighted material?

The authors’ complaint cites a June 2018 paper in which OpenAI revealed that it trained its GPT-1 model on BookCorpus, “a collection of over 7,000 unique unpublished books from a variety of genres including adventure, fantasy, and romance.”


In its July 2020 paper (pdf) introducing GPT-3,OpenAI disclosed that 15% of its training dataset came from “two internet-based books corpora” that OpenAI simply called “Books1” and “Books2”. “Books1", the authors said in their complaint, is about nine times larger than BookCorpus, while Books2 is 42 times bigger. These two sets of data alone would thus contain more than 350,000 books.

Since the launch of ChatGPT last November, OpenAI has never revealed what precise data it used to train the bot, nor the source of that data. In its 2020 paper, OpenAI merely said that most of the training data was generally scraped from the web, including archived books and Wikipedia.

Let the AI copyright battles begin

The lawsuit by Tremblay and Awad inaugurates a battle between copyright owners of works used to train large language models and AI companies. It also amplifies previous demands for damages for works used without consent, despite difficulties in proving that copyright owners actually suffered financial losses from these infringements.


In January, a group of visual artists sued Stability AI, Midjourney, and DeviantArt, arguing that these AI engines used the artworks of human artists to produce images in their styles. And in May, Ashley Irwin, president of the Society of Composers and Lyricist, told a House judiciary subcommittee that the copyrights of creators had to be protected from generative AI systems.

Last November, computer programmers filed a $9 billion class action lawsuit against Microsoft, the code-sharing site GitHub, and OpenAI. The suit argued that Copilot, an AI-powered coding assistant on GitHub, uses other people’s code in a way that amounts to software piracy. Copilot was charged with infringing copyright by using lines of code written by humans without proper attribution.


With this latest lawsuit from Tremblay and Awad, regulators and courts will be tasked with mulling over the rules of copyright with regards to AI. They may require generative AI companies to disclose how and where they sourced their training data, letting the world peek inside the black box of these AI systems for the very first time.