Does AI Steal Books? How AI Is Being Trained with Pirated Books

HIT PLAY OR READ THE POST BELOW:

The rapid spread and evolution of large language models such as ChatGPT has taken nearly every industry across the world by storm. And as these AI chatbots are shaking up the way that we work and do business, regulatory bodies are only now trying to catch up. The book publishing industry is certainly not immune to these trends, and there have been more and more conversations about the intersection between book publishing and AI.

Many of these tough conversations have centered on whether authors should be allowed to leverage AI as they are composing their books or aspects of their books, but today I want to talk about a related but different issue that emerged in recent weeks: copyrighted books being used to train these AI chatbots.

I'm going to go over how AI has managed to steal all these books, how authors are responding to this issue, and what you should do if you are an established or aspiring author. It is, admittedly, a really strange time for book writing and publishing, so I want to help you navigate all the aspects of it — including how AI is changing things up for us.

How AI Is Stealing and Training on Copyrighted Books

So, how is AI stealing books? You likely know that large language AI models like ChatGPT require an immense amount of text data to train how to respond to users’ prompts. Recently, it came to light that many of these AI chatbots are using a large repository of pirated books in their data set. This means these books were put on the internet without the author's knowledge or consent, in full, for free — versus someone having to pay for that book.

These works were under copyright, they were not supposed to be publicly available, and as I mentioned, authors did not agree at any point to have their books help train AI chatbots. The information or stories contained in these books were then used to inform how the AI responds. It can even mimic that author's style or just outright take their ideas, not necessarily even giving them credit for it. If you've played around with ChatGPT, you know that it's not citing any sources in its responses.

One particular repository of pirated books that came to light in recent weeks is called “Books3,” and it included somewhere between 150,000–200,000 pirated books, including non-fiction and fiction. It was part of an open-source data project designed to help compile data to train AI models, but an anti-piracy group called the Rights Alliance discovered that it was taking pirated books and called for the data source to be taken down. Since then, Books3 has been taken down, but it's likely that there are still copies of this data set out there being used to train the AI chatbots.

How Authors Are Responding to AI Copyright Infringement Issues

Next, I want to talk about how authors have responded to this discovery of their books being used to train AI and what they're doing about it. As you can imagine, when authors discovered that their books were being used in this way, many of them were not happy.

In fact, more than 10,000 authors signed an open letter to the CEOs of major AI companies put together by the Authors Guild. This includes the CEOs of companies like Meta, OpenAI and Microsoft, and some of the authors who signed the letter include Margaret Atwood, Roxanne Gay, Dan Brown and James Patterson.  

The authors asked for three specific actions to be taken relative to AI’s use of their work:

  1. Obtain permission for use of our copyrighted material in your generative AI programs

  2. Compensate writers fairly for the past and ongoing use of our works in your generative AI programs

  3. Compensate writers fairly for the use of our works in AI output, whether or not the outputs are infringing under current law

Time will tell how people in leadership positions at these AI companies respond to this letter and deal with the issue of pirated works being used in the AI data sets, but it's very clear that this is a pressing issue for the publishing industry and something that is top of mind for authors right now.  

What Should Authors Do to Protect Their Book from Pirated AI Data Sets?

So, if you're an established or an aspiring author and you're concerned about the potential of your book being used in AI capacities, what should you do? This is such a weird landscape to be in right now.  

First, if you are in the position of having a literary agent and/or a publisher that you are working with, I recommend having a frank, open, honest conversation with them: Are there any ways to guarantee protections in any of the contracts you have now or any future contracts that you sign relative to how AI could factor into your book's publication?

More broadly, you can also advocate for more legal protections surrounding creative copyrighted work. It's important to note that the U.S copyright office confirmed its position that human authorship is necessary for a work to be given a copyright. 

Something else I recommend doing is just staying up to speed on how AI is evolving and what use cases are emerging for these large language models. Many authors want to reject AI entirely, and I totally understand why. You want to focus on telling your story from your own mind. Maybe you even want to pretend that AI doesn't exist at all; you just want to dissociate from it. But the truth is that you can't protect yourself or your work if you just don't know what it is or how it works. So, even if you don't want to use it in any capacity, at least make sure you understand what's going on.

And lastly, know your worth as an author and as a creator. Don't let your creative product be taken advantage of. The authors who signed that open letter to the AI companies emphasize that this technology was only made possible because of their work; human creativity is how this technology has even come into existence.  

Even though there's obviously an issue with how AI training is being handled right now, I hope this at least assures you of your value as an author and that the value of the stories you are bringing into this world is not to be discounted and is not going away — even as AI becomes more sophisticated.

I hope this gave you some more insight into these ongoing issues involving AI and book publishing, especially about this issue of large language models being trained on pirated books.

Thanks so much for reading and happy writing!

Need an expert pair of eyes on your query letter?

I’ve got a service just for that!


View more:


Previous
Previous

How to Write a Strong Character Objective for Your Protagonist (with Examples)

Next
Next

Inspiring Author Success Stories — from Writers Who Nearly Gave Up