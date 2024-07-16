With the generative artificial intelligence boom underway, tech companies are looking for training data to improve their models — and some are taking without permission.

Apple, Nvidia, and Anthropic are among the tech companies found to have trained AI models with subtitles from tens of thousands of YouTube videos despite the platform’s rules against downloading and using its content without permission, according an investigation by Proof News that was co-published with Wired.

The investigation found that the companies were using a dataset called YouTube Subtitles that included transcripts of 173,536 YouTube videos from over 48,000 channels. Videos in the dataset span from educational channels such as Khan Academy and MIT, to news sites including The Wall Street Journal, to some of the platform’s top creators like MrBeast and Marques Brownlee.

“Apple has sourced data for their AI from several companies,” Brownlee wrote in a post on X addressing the investigation. “One of them scraped tons of data/transcripts from YouTube videos, including mine.”

Brownlee added that while “Apple technically avoids ‘fault’ here because they’re not the ones scraping,” “this is going to be an evolving problem for a long time.”

Proof News also created a tool for creators to search for their content in the dataset, which included a handful of videos from Quartz. The YouTube Subtitles dataset does not include imagery from videos, but does include some translated subtitles in languages such as German and Arabic.

The dataset was created by Eleuther AI, “a non-profit AI research lab” that is focused on “promoting open science norms,” and is part of the nonprofit’s compilation of material from other places, including the European Parliament and English Wikipedia, called the Pile, according to Proof News.

“The Pile dataset referred to in the research paper was trained in 2021 for academic and research purposes,” a spokesperson for Salesforce, one of the companies named in the investigation for using the dataset, said in a statement shared with Quartz. “The dataset was publicly available and released under a permissive license.”

Neither Apple, Nvidia, nor Anthropic immediately responded to a request for comment.

In April, YouTube chief executive Neal Mohan told Bloomberg that companies using YouTube videos, including transcripts or video bits, to train AI models such as OpenAI’s text-to-video generator, Sora, would be a “clear violation” of the platform’s policies. However, the New York Times reported days later that OpenAI had transcribed over a million hours of YouTube videos to train its GPT-4 model.