1 Min Read

AI's Dirty Little Secret: Models Fueled by Pirated Books

Featured Image

Wondering how to get started with AI? Take our on-demand Piloting AI for Marketers Series.

Learn More

The Atlantic just released a major investigative journalism piece that proves popular large language models, like Meta’s LLaMA, have been using pirated books to train their models.

Why it matters:

This raises serious copyright concerns around how large language models have been trained.

Says the article:

“Upwards of 170,000 books, the majority published in the past 20 years, are in LLaMA’s training data. . . . These books are part of a dataset called “Books3,” and its use has not been limited to LLaMA. Books3 was also used to train Bloomberg’s BloombergGPT, EleutherAI’s GPT-J—a popular open-source model—and likely other generative-AI programs now embedded in websites across the internet.”

According to an interview in the story with the creator of the Books3 dataset of pirated books, it appears Books3 was created with altruistic intentions. The developer behind the Books3 dataset said he created it to give independent developers “OpenAI-grade training data,” in fear of large AI companies having a monopoly over generative AI tools.

Connecting the dots:

In Episode 61 of the Marketing AI Show, Marketing AI Institute founder/CEO Paul Roetzer broke down what we can expect to happen next.

  1. AI companies may try to rely on “fair use” arguments to justify this. Fair use doctrine in U.S. copyright law states that sometimes copyrighted material may be used if it meets certain criteria—including how it’s used and how much material is used. It’s unclear if this is a justifiable strategy in the case of using copyrighted material to train AI models.
  2. But AI companies are now on notice. “It seems like, if nothing else, these companies were very aggressive in using stuff that might not have been allowed to be used,” says Roetzer. In the future, Roetzer doesn’t see that happening. AI companies now know they’re being watched closely in this respect, and that future laws and regulations may catch up to them.
  3. The likely way forward is licensing deals. “I assume that the play moving forward is to try and license the best examples of writing possible, including books,” says Roetzer. It’s not viable for AI companies to continue trampling on copyright, especially with lawsuits pending. So it’s possible they’ll move forward more aggressively with licensing content from trusted publishers. For instance, OpenAI and the New York Times are attempting to reach such a licensing deal. This would give models high-quality content on which to train—without breaking the law.

Related Posts

Inside Anthropic's AI Development: CEO Reveals Billion-Dollar Training Runs and Path to More Powerful AI

Mike Kaput | November 19, 2024

In a revealing five-hour interview with Lex Fridman, Anthropic CEO Dario Amodei pulled back the curtain on what it really takes to build cutting-edge AI models.

Big Problems Discovered with AI Training

Mike Kaput | April 25, 2023

AI companies like OpenAI are coming under fire for how AI tools are trained…

Google Bard Now Surpasses GPT-4

Mike Kaput | January 30, 2024

Google Bard just made a stunning leap in capabilities. It just beat GPT-4 on a top leaderboard that evaluates AI models.