Retrieval-Augmented Generation (RAG)

Every author has a version of the same ritual. You’re writing a scene set in a city you visited years ago, and you stop mid-sentence to pull a travel guide off the shelf. Or you flip back three chapters to double-check what color you made the protagonist’s eyes. Or you open a browser tab to confirm whether gas streetlamps were common in 1880s London. You don’t try to write from memory alone, because memory gets things wrong.

AI models, left to their own devices, do try to write from memory alone. Everything a large language model knows is baked into its training data, frozen at whatever point the training stopped. Ask it about your characters, your world, or your plot, and it has nothing to check. So it guesses. Sometimes brilliantly, sometimes not. This is one of the main reasons AI hallucinates: it has no reference shelf to consult.

Retrieval-Augmented Generation, mercifully shortened to RAG, gives the AI a reference shelf.

What RAG Actually Does

The name breaks down into three pieces, and each one earns its place.

Retrieval: Before generating a response, the system searches through a collection of documents (your manuscript, a world bible, uploaded research files) and pulls out the passages most relevant to your question or prompt.

Augmented: Those retrieved passages get added to the prompt the AI actually sees. The model isn’t just working from its training anymore; it’s working from your specific material.

Generation: The AI produces its response with both its general knowledge and your retrieved documents in view.

The analogy that fits best for authors: RAG turns an AI from a writer working entirely from memory into a writer who keeps your notes open on the desk beside them.

The Accidental Acronym

In May 2020, a team of researchers led by Patrick Lewis published a paper called “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Lewis was a PhD student at University College London, simultaneously working at Facebook AI Research’s London lab. The paper, which included twelve co-authors from Facebook AI, UCL, and NYU, was accepted at NeurIPS, one of the most prestigious machine learning conferences in the world.

The problem they were solving was fundamental. Large language models store everything they know inside their parameters (the billions of numerical weights adjusted during training). That knowledge is static, impossible to update without retraining, and impossible to verify. When a model gets a fact wrong, you can’t point to the error and fix it, because the knowledge is smeared across billions of interconnected numbers.

Lewis and his team proposed a surprisingly elegant solution: give the model two kinds of memory. Parametric memory, the knowledge baked into its weights during training. And non-parametric memory, an external knowledge base (in their experiments, all of Wikipedia) that the model could search on the fly. The system would retrieve relevant passages before generating each answer, grounding its output in real documents rather than statistical guesswork.

The results were dramatic. RAG beat the previous state of the art on multiple question-answering benchmarks, improving accuracy by as much as eight percentage points. But perhaps the more lasting contribution was the idea itself, which turned out to be wildly generalizable. Any document collection could serve as the external memory: legal filings, medical records, corporate databases, or (as authors would eventually discover) fictional world bibles.

As for the name, Lewis has publicly admitted he wishes they’d thought harder about it. “We definitely would have put more thought into the name had we known our work would become so widespread,” he said in a later interview. The team had always planned to come up with something better, but when the submission deadline arrived, nobody had a replacement. So RAG it was, and RAG it remains.

How It Works Under the Hood

The mechanics are more intuitive than they sound.

First, your documents get broken into smaller chunks (paragraphs, roughly) and converted into numerical representations called vector embeddings, lists of numbers that capture the meaning of each chunk. These embeddings get stored in a searchable index.

When you type a prompt, your question gets converted into the same kind of embedding. The system then compares your question’s embedding against all the stored chunks, looking for the closest semantic matches. Think of it as a librarian who reads your question and immediately knows which three pages in which three books to pull.

Those matching passages get inserted into the context window alongside your original prompt. The AI generates its response with your specific reference material right in front of it.

This is different from simply pasting your entire manuscript into a chat window. RAG systems do the selection automatically, surfacing only what’s relevant instead of flooding the model with everything. That’s what makes it scale: your knowledge base can be enormous, but the AI only sees the slices it needs for each request.

Why This Matters for Your Writing Life

If you’ve used any AI writing tool that lets you upload documents or build a knowledge base, you’ve already used RAG (or something very close to it). Understanding what’s happening behind the scenes helps you use these tools more effectively.

NovelCrafter’s Codex is one of the clearest examples. When you build entries for characters, locations, and lore, NovelCrafter doesn’t dump everything into every prompt. It detects which elements are relevant to the scene you’re working on and retrieves only those entries. Write a scene featuring your antagonist in the library, and NovelCrafter pulls the antagonist’s personality notes and the library’s description into the AI’s context, leaving everything else on the shelf. The better you write your Codex entries, the better the retrieval works.

Claude’s Projects feature uses RAG explicitly. Upload your manuscript, style guide, and research documents to a project, and Claude retrieves relevant sections as needed rather than trying to hold everything in memory at once. As your project grows, the system automatically switches to retrieval mode to handle the volume.

ChatGPT’s Custom GPTs work the same way. Enable file search and upload reference documents, and the system chunks, indexes, and retrieves relevant passages when you ask questions. This is why “chat with your PDF” tools exist as an entire product category: they’re all RAG applications under the hood.

NotebookLM, Google’s research tool, takes the concept even further by refusing to answer from anything except your uploaded sources. It’s RAG in its purest form, and it’s remarkably useful for authors doing deep research who want an AI that won’t wander beyond their source material.

The practical takeaway is this: the quality of what you put in directly shapes the quality of what comes out. Well-organized character sheets, clearly written world-building documents, and focused research files give the retrieval system better material to find. Messy, contradictory, or vague source documents mean the AI retrieves messy, contradictory, or vague context. RAG is powerful, but it’s only as good as the library you build for it.

Understanding RAG also clarifies the difference between two popular approaches to customizing AI. RAG gives the model reference material to consult (like bringing notes to an open-book exam), while fine-tuning changes the model itself (like studying until the material becomes second nature). Most author-facing tools use RAG because it’s faster, cheaper, and lets you update your knowledge base anytime without retraining anything. That’s worth knowing the next time a tool asks you to upload your world bible or build a character database. It’s not just storing your files. It’s building the AI a better reference shelf.