Training Data

Somewhere in the sprawling dataset that taught ChatGPT to write, there are romance novels. Thousands of them. There are also literary award winners, self-published thrillers, poetry collections, children’s books, technical manuals, and fanfiction that probably should have stayed in someone’s drafts folder. All of it, every paragraph and plot twist, became raw material for a machine that was learning what language looks like.

If you’ve ever asked an AI writing tool to draft a scene and felt a flicker of recognition in what came back, like it had somehow read the same books you had, that’s not a coincidence. It probably did.

What Training Data Actually Is

Training data is the material an AI model studies in order to learn. That’s it, at its core. Just as a student needs textbooks, lectures, and practice problems to develop understanding, an AI model needs examples to learn patterns. Feed it enough examples of how language works, and it starts picking up the rules on its own: grammar, sentence structure, the way a mystery novel builds tension differently than a self-help book.

For large language models like GPT-4 or Claude, training data means text. Enormous quantities of text. Web pages, digitized books, Wikipedia articles, news stories, forum posts, academic papers, and code repositories, all collected, cleaned, and fed into the model so it can learn to predict what word comes next in a sequence. The model doesn’t memorize the data (at least, that’s the goal). Instead, it extracts patterns and probabilities, building an internal sense of how language flows.

Think of it this way: if you read a thousand mystery novels, you wouldn’t memorize every sentence, but you’d develop an instinct for pacing, red herrings, and the rhythm of a reveal. Training data gives an AI that same kind of instinct, except the AI reads millions of texts and processes them mathematically rather than intuitively.

From Checkerboards to Common Crawl

The idea that machines can learn from examples is as old as machine learning itself. In 1952, an IBM researcher named Arthur Samuel built a program that played checkers. The program wasn’t given a strategy. Instead, Samuel fed it records of expert games and had it play thousands of matches against itself. Each game was data the program could learn from: which moves led to wins, which led to losses, which board positions were promising. By 1959, when Samuel published his landmark paper “Some Studies in Machine Learning Using the Game of Checkers,” the program could beat its own creator. Those game records were, in the modern sense, its training data.

For decades, training data remained relatively small and carefully curated. Researchers would assemble hundreds or thousands of labeled examples for specific tasks: spam vs. not-spam, positive sentiment vs. negative sentiment, cat vs. dog. The data was modest because the models were modest.

Then the scale changed.

When OpenAI trained GPT-3 in 2020, the model consumed roughly 570 gigabytes of text, drawn from Common Crawl (a massive snapshot of the internet), a curated web text corpus, two undisclosed book datasets known only as “Books1” and “Books2,” and English Wikipedia. By the time GPT-4 arrived in 2023, estimates put its training data at around one petabyte, equivalent to one million gigabytes. To put that in perspective, if you read one book a day for 80 years, you’d get through about 29,000 books. GPT-4’s training data likely included millions.

The approach shifted, too. Instead of carefully labeling every example (“this sentence is positive, this one is negative”), modern LLMs learn from unlabeled text through a process called self-supervised learning. The model reads a sentence, hides the last word, tries to predict it, checks whether it was right, and adjusts. Billions of times. The training data doesn’t need to be organized or annotated. It just needs to be vast.

The Part Where Your Books Come In

And this is where the story gets personal for authors.

In 2020, a dataset called Books3 appeared online, assembled by a group of open-source AI researchers. It contained 196,640 books in plain text, scraped from shadow libraries (pirate sites like Library Genesis and Z-Library). Fiction, nonfiction, bestsellers, backlist titles, all digitized and offered up as training data for anyone building a language model. Meta used it to train LLaMA. Other companies used datasets just like it.

The discovery of Books3 set off a chain of lawsuits that reshaped the conversation about AI and intellectual property. In 2023, authors Sarah Silverman and Michael Chabon sued OpenAI, alleging their books had been used as training data without permission. The Authors Guild filed a separate class action. More suits followed against Meta, Anthropic, and others. By 2025, over 70 copyright infringement cases had been filed against AI companies.

The most significant ruling came in June 2025, when a federal judge in the Bartz v. Anthropic case described AI training as “transformative, spectacularly so,” suggesting that the act of learning from copyrighted text may qualify as fair use. But the judge drew a line: Anthropic’s downloading of pirated books from shadow libraries was not fair use. Anthropic settled for approximately $1.5 billion, paying roughly $3,000 for each of the 482,460 books it had downloaded.

The legal landscape is still shifting. The U.S. Copyright Office released a 108-page report in May 2025 arguing that certain uses of copyrighted material for AI training cannot be defended as fair use. Courts in different jurisdictions are reaching different conclusions. For authors, the question isn’t abstract. The text that trained the AI tools you use almost certainly includes published books, and the question of whether that usage was legal, ethical, or fair is one the courts are still answering.

Why This Matters for Your Writing Life

Understanding training data isn’t just about the copyright debate, though that matters. It’s about understanding why AI tools behave the way they do.

Training data explains genre competence. When an AI writing tool produces surprisingly good fantasy prose but stumbles with literary fiction (or vice versa), the difference often comes down to what it trained on. A model that absorbed more genre fiction during training will have stronger instincts for genre conventions. This is also why fine-tuning matters: tools like Sudowrite’s Muse model are specifically trained on creative writing data with author consent, and the difference in output quality compared to a general-purpose model is noticeable.

It explains hallucination. When an AI invents a book that doesn’t exist or attributes a quote to the wrong person, the root cause is training data. The model learned patterns of what citations, book titles, and attributions look like, but it never learned which specific ones are real. It’s pattern-matching, not fact-checking.

It shapes what the AI knows and doesn’t know. Most training data for major LLMs has a cutoff date, meaning the model has no knowledge of events that happened after its training data was collected. If you ask ChatGPT about a book published last month, it won’t know about it, not because it’s broken, but because that book wasn’t in the data it learned from.

Your own data matters, too. When you paste your manuscript into an AI tool, you’re creating a form of real-time training data (though most reputable tools, including Sudowrite and NovelCrafter, explicitly promise they don’t use your writing to train their models). Still, the text you provide in your prompts shapes the output of that specific session. Give the AI three chapters of your novel as context, and it will generate prose that echoes your voice, because you’ve given it examples to learn from in the moment. That’s training data in miniature.

The next time you open an AI writing tool and it produces a sentence that lands just right, remember: that capability didn’t come from nowhere. It came from data. Millions of texts written by millions of people, processed into patterns and probabilities. Some of those people were authors like you, and the conversation about what that means, for copyright, for creativity, for the future of publishing, is just getting started.