RAG, Part 2: Chunking, the Decision That Quietly Decides Everything

Part 2 of a 3-part series on RAG. Part 1 covered what RAG is and how retrieval finds meaning. This part is about chunking. Part 3 is about making it good.

In Part 1, I made retrieval sound clean. Turn your documents into points on a map of meaning, find the points nearest the question, hand them to the model. Simple.

I left out the messy step that happens first, and it is the step that most quietly decides whether your whole system works.

Before any of that map-of-meaning magic, you have to take your documents and cut them into pieces. You cannot embed a 300-page manual as one giant blob. So you slice it up, and each slice becomes one searchable chunk. The question is: where do you cut?

It sounds trivial. It is not. Cut in the wrong places and you will retrieve garbage no matter how good your model is. This is the part people skip past in tutorials, and then wonder why their RAG demo gives confident, useless answers. So let us actually look at it.

Why you cannot just split every 500 characters

The lazy approach is to chop the text every N characters and move on. Fast, simple, and it will hurt you. Here is why.

Imagine your document has this sentence: “Refunds are processed within 30 days. To start one, email support.” If your blind cut lands right in the middle, you get one chunk ending at “within 30” and the next starting at “days. To start one.” Now someone asks how long refunds take, and the retriever finds a fragment that says “Refunds are processed within 30” and stops. The actual answer got sliced in half and scattered across two chunks. Neither chunk is complete. Both are slightly wrong.

A good chunk is like a good paragraph: it holds one idea, completely, with enough around it to make sense on its own. The art of chunking is cutting on the natural seams of meaning instead of in the middle of them.

In the real world the documents you chunk are rarely tidy. They are sprawling OpenStack manuals, product FAQs, internal wikis, policy PDFs, the messy stuff people actually need answers from. To keep the demo readable I will use a tiny passage of my own below, but picture it as a few sentences pulled from a page of one of those. Click through the strategies and watch where each one decides to cut.

The same short passage, cut by four different strategies. Alternating shading and outlines mark where one chunk ends and the next begins. Watch how the smarter strategies refuse to cut mid-sentence.

The strategies, in plain terms

Let me name what you just watched.

Fixed-size. Cut every N characters or tokens, full stop. It is the fastest and the dumbest. It happily guillotines sentences. People use it because it is one line of code, and then they are confused when retrieval is poor. It is not useless, but it is rarely the right default.

Recursive. This is the workhorse, and the one I reach for first. Instead of cutting blindly, it tries to split on the biggest natural boundary first: paragraphs. If a piece is still too big, it splits that piece on sentences. Still too big, it splits on words. It keeps the strongest units of meaning whole for as long as it can. The popular implementation tries separators in order, roughly “paragraph break, then line break, then space, then last-resort character.” Simple idea, very good results.

Sentence-window. A lovely trick. You index single sentences, which makes retrieval pinpoint-accurate, because a one-sentence chunk is unambiguous about what it is. But a lone sentence is thin context for the model. So at answer time, you quietly expand the match to include the sentences around it. You get precise matching and rich context at once.

Semantic. The clever one. Instead of counting characters, it embeds each sentence and measures how much the meaning shifts from one to the next. Where the topic genuinely changes, it cuts. Where ideas belong together, it keeps them together. It produces the most coherent chunks. It is also the most expensive, because you are running an embedding model across the whole document just to decide where to cut.

The unsung hero: overlap

There is one more idea that sits on top of all of these, and it fixes the sliced-sentence problem almost for free.

Overlap. When you cut chunk two, you let it start a little before chunk one ended. The pieces share a small overlapping band. So if an idea straddles a boundary, it survives intact in at least one chunk instead of being orphaned across two.

Refunds are processed within 30 days. To start

30 days. To start one, email support.

The accent band is the overlap, shared by both chunks. The phrase that used to get cut in half now lives complete inside both pieces. A small, cheap insurance policy against bad cuts.

A common setting is a chunk of a few hundred tokens with an overlap of ten to twenty percent. It costs you a little storage and a little redundancy, and it saves you from a whole category of silent failures. Almost always worth it.

Now the honest part: which one actually wins?

Here is where I want to be careful, because this is exactly the kind of thing people repeat without checking. So these numbers are from Chroma’s published chunking evaluation, not my guesses, and I will link it.

Their team tested many strategies on the same benchmark and measured recall, which is roughly “of all the chunks that should have been retrieved, what fraction did we actually get.” Higher is better. A few real results:

Strategy	Chunk setup	Recall	Cost
Recursive	400 / 200 overlap	88.3%	cheap
Recursive	200 / no overlap	88.5%	cheap
LLM semantic	~240 tokens	91.7%	very slow
Cluster semantic	~400 setting	92.1%	expensive

Selected results from Chroma's chunking evaluation (text-embedding-3-large, top 5 retrieved). The semantic methods do win on recall, by a few points. But read the cost column before you celebrate.

So semantic chunking is the best, right? Pack it up, go home?

Not so fast, and this is the lesson I most want you to take from Part 2. The semantic methods won by a few points of recall. But the Chroma team noted the LLM-based chunker took tens of minutes to run, which makes it impractical for most real document collections. Meanwhile plain recursive chunking, the cheap and simple one, landed around 88 percent, just a few points behind, in a tiny fraction of the time and cost.

Read that gap honestly. You are paying a large amount of extra compute to move from 88 to 92. Sometimes those four points matter enormously, in medicine, in law, where a missed chunk is a real harm. Often they do not, and the simple thing is the right thing.

What I actually do

After all the theory, here is the boring, effective advice I would give a friend starting out:

Start with recursive chunking, a few hundred tokens, ten to twenty percent overlap. It is cheap, simple, and gets you most of the way.
If you find retrieval is missing context that sits right next to a good match, add a sentence-window layer so the model sees the neighbourhood.
Only reach for semantic chunking when you have measured a real problem that the simpler methods cannot fix, and you can afford the cost. Do not start there.
Whatever you pick, measure it on your own documents. The benchmark winner on someone else’s data is not guaranteed to win on yours. Your handbook is not their dataset.

That last point matters more than any strategy name. The teams that win at RAG are not the ones with the fanciest chunker. They are the ones who actually looked at what their system retrieved, saw it was bad, and fixed the cut.

Chunking is unglamorous. It is also the difference between a demo and a product. Get this right and you have done the hard, quiet work that most people skip.

In Part 3, we leave chunking behind and tackle retrieval itself: why pure meaning-search misses exact terms, how keyword search and vector search work better together, what reranking is, and the strange ways RAG still fails even when everything looks right. That is where it gets genuinely fun.

Benchmark figures in this post are from Chroma’s “Evaluating Chunking Strategies for Retrieval” research. The explanations and example are my own.