"Shadow libraries" are at the heart of the mounting copyright lawsuits against OpenAI

ChatGPT could be trained on massive datasets of books that skirt copyright laws

By Michelle Cheng4 min readUpdated July 10, 2023

Comedian and author Sarah Silverman is one of three writers to file a class-action lawsuit against the technology company OpenAI, the creator of ChatGPT, for copyright infringement. The writers also sued Meta $META, which has its own large language model called LLaMa, for training on their content without permission.

In the lawsuit, the plaintiffs allege that they “did not consent to the use of their copyrighted books as training material for ChatGPT,” claiming the texts were “ingested and used to train” the artificial intelligence chatbot.

To generate responses that sound like a human wrote them, AI bots are trained on vast amounts of data collected from the internet. But OpenAI is opaque about what source texts it uses to train its models, citing “the competitive landscape and the safety implications” of large-scale models like GPT-4.

Many types of materials are used to train large language models, and books are a key part of the training datasets because they offer lengthy examples of high-quality writing. But according to Silverman’s lawsuit, most of the book data comes from OpenAI training on “illegal shadow libraries” that contain the writers’ work.

Under the hood of OpenAI’s book training data

So, what do we know about how ChatGPT is trained? OpenAI has said that 15% of the training set for GPT-3, the language model currently being used for the free version of the AI bot, comes from “two internet-based books corpora” that the company simply calls “Books1” and “Books2,” according to the lawsuit.

However, there are clues about these two data sets. “Books1” is linked to Project Gutenberg (an online e-book library with over 60,000 titles), a popular dataset for AI researchers to train their data on due to the lack of copyright, the filing states. “Books2” is estimated to contain about 294,000 titles, it notes.

Most of the “internet-based books corpora” is likely to come from shadow library websites such as Library Genesis, Z-Library, Sci-Hub, and Bibliotik. The books aggregated by these sites are available in bulk via torrent websites, which are known for hosting copyrighted materials.

What exactly are shadow libraries?

Shadow libraries are online databases that provide access to millions of books and articles that are out of print, hard to obtain, and paywalled. Many of these databases, which began appearing online around 2008, originated in Russia, which has a long tradition of sharing forbidden books, according to the magazine Reason.

Soon enough, these libraries became popular with cash-strapped academics around the world thanks to the high cost of accessing scholarly journals—with some reportedly going for as much as $500 for an entirely open-access article.

These shadow libraries are also called “pirate libraries” because they often infringe on copyrighted work and cut into the publishing industry’s profits. A 2017 Nielsen and Digimarc study (pdf) found that pirated books were “depressing legitimate book sales by as much as 14%.”

Governments around the world have cracked down on shadow libraries. Last October, the FBI seized several websites associated with Z-Library and charged two Russian nationals with criminal copyright infringement, wire fraud, and money laundering. But after the US government took down one of the site’s main online locations, others created mirrors of the site as Vice reported. Courts in France and India have also ordered internet service providers to block Z-Library.

Solutions to handling the training of copyrighted content

Silverman isn’t alone in suing generative AI companies. Earlier this year, a group of visual artists sued Stability AI, Midjourney, and DeviantArt for copyright infringement. Last November, GitHub programmers filed a class-action lawsuit against GitHub, its parent company Microsoft $MSFT Corp., and OpenAI, which counts Microsoft as a major investor. The lawsuit alleges that GitHub Copilot, an AI product, relies on “unprecedented open-source software piracy.”

In response to the growing lawsuits, Pau Garcia, the founder of Domestic Data Streamers, an art consulting firm, wrote in a LinkedIn post in January that AI companies should shift their training models to only use the material in the public domain or remove the artist’s work from the models. Companies can pay artists outright to use their content for training data, Garcia added.

Firms are also toying with letting artists have a say over what content AI models can be trained on. In May, music streaming platform Audius launched a new feature allowing artists to create a page for their work that anyone can use for AI-generated tracks.

The essential business news, delivered fresh every morning.

Join 500,000+ readers who start their day with Quartz.

By subscribing, you agree to our Terms of Service and Privacy Policy.