A 3-part series on RAG. You are reading Part 1: the foundation. Part 2 is about chunking. Part 3 is about making it actually good.

Let me start with a small, slightly embarrassing story about how smart these models are not.

You can ask a modern language model almost anything. The causes of the French Revolution. How to reverse a linked list. A decent risotto recipe. It has read a staggering amount and will answer with total confidence.

Now ask it this: “What is our company’s refund policy?”

It has no idea. It cannot know. Your company handbook was never in its training data, and it never will be. So one of two things happens. The honest model says “I do not have access to that.” The overconfident one invents a refund policy that sounds perfectly reasonable and is completely made up. Neither is useful, and the second is dangerous.

This is the gap RAG was built to close. And once you see the shape of the problem, the solution is almost obvious.

The model is brilliant and blind

Here is the core tension. A language model knows a lot, but only about the world it was trained on, frozen at a moment in the past. It knows nothing about:

  • Your private documents
  • Anything that happened after its training cut-off
  • The specific, niche, true details of your particular situation

You could try to fix this by retraining the model on your data, but that is slow, expensive, and you would have to do it again every time a document changes. That is a sledgehammer for a problem that needs a scalpel.

RAG is the scalpel. The trick is to stop expecting the model to know your information, and instead hand it the right information at the moment you ask the question.

That is the whole idea, in one sentence: before the model answers, go find the relevant documents and put them in front of it. Retrieval, then generation. RAG.

A librarian, not a know-it-all

The analogy that finally made it click for me is a library.

Imagine you walk up to a reference librarian with a question. The librarian does not have every fact memorised. What they have is something better: they know how to find the right book, fast, and they will read you the relevant passage before answering.

A plain language model is the know-it-all at the party who answers everything from memory and is sometimes confidently wrong. A RAG system is the librarian. Same intelligence, but it looks things up first, and it can point you to the exact page it used.

you ask"What is our refund policy?"
retrievefind the matching pages from your documents
augmentpaste those pages next to your question
generatemodel answers using that context
The RAG loop. Retrieve, augment, generate. The model never had to "know" your policy. It just had to read the right page at the right time.

Notice what this buys you, beyond just correct answers:

  • The answer is grounded in real documents, so it hallucinates far less.
  • You can show the source, which means a human can check it.
  • When a document changes, you change the document, and the next answer is already up to date. No retraining.

That last point is why RAG quietly became the default way to build serious AI systems on private or fast-moving data. It is practical in a way that retraining never is.

But what does “find the relevant pages” actually mean?

Here is where most explanations wave their hands, and where the real idea lives. When I say “go find the relevant documents,” how does a computer decide what is relevant?

It cannot just search for matching words. If you ask “how do I get my money back” and the document says “refund procedure,” there is not a single shared word, yet they obviously mean the same thing. Keyword search would miss it.

So RAG uses something cleverer: it turns meaning into numbers.

Every chunk of text, your documents and your question alike, gets passed through an embedding model. This is a cousin of the language model whose only job is to convert text into a long list of numbers, a vector, arranged so that things which mean similar things end up close together. “Refund” and “money back” land near each other. “Refund” and “photosynthesis” land far apart. The text becomes a point in a kind of map of meaning.

I covered this map idea briefly in my post on how LLMs work. In RAG, it becomes the whole engine.

Closeness is the answer

Once your question and all your document chunks are points on this map, “find the relevant ones” turns into something a computer is great at: find the points nearest to my question.

The usual way to measure “near” is called cosine similarity. You do not need the maths. Just the intuition: it gives a score, where 1 means the two pieces of text point in the same direction (very similar) and 0 means they are unrelated. The system scores your question against every chunk and grabs the top few.

Try it. Pick a question below and watch how the same set of document chunks gets scored differently depending on what you asked.

your question:
Illustrative scores, not a live model, but the behaviour is exactly right: the chunk that means the same thing rises to the top, even when it shares no words with the question. That is the magic keyword search cannot do.

Play with it for a second. Ask “how do I get my money back” and the refund line wins, even though they share no words. Ask “how do I stop being billed” and suddenly the cancellation line jumps ahead of the refund line, because cancelling is closer in meaning to your intent. The system is not matching letters. It is matching meaning. That is the quiet superpower underneath every RAG system.

So, the whole picture

Let me put Part 1 together in plain terms.

A language model is brilliant but blind to your private and recent information. RAG fixes that not by teaching the model new facts, but by fetching the right facts at question time and handing them over. To fetch the right facts, it turns every piece of text into a point on a map of meaning, then grabs the points nearest your question. The model reads those, and answers grounded in real, checkable sources.

That is RAG at its core, and honestly, if you only ever understood this much, you would understand more than most people shipping these systems.

But here is the catch, and it is the thing that separates a toy demo from something that actually works: what you retrieve is only as good as how you cut up your documents in the first place. If your chunks are bad, your retrieval is bad, and no clever model can save you.

That unglamorous, make-or-break decision is called chunking, and it is the entire subject of Part 2. It is the part most people get wrong, and the part where a little understanding pays off enormously.

See you there.

← Back to blog