Context Engineering: Why AI Forgets, and How We Fight It
Talk to an AI assistant long enough and you will feel it happen. Something you told it near the start, your name, a constraint, a decision you both agreed on, quietly stops being true for it. It contradicts itself. It asks for a thing you already gave it. It forgets.
This is not the model being dumb. It is the model hitting a wall that every one of them has: a fixed-size working memory, and a hard edge to it. Everything the model can “see” at once, your message, its own past replies, any documents, the instructions, all of it has to fit inside one bounded space called the context window. Fill it up, and something has to go. And here is the twist that surprised even the researchers: the model starts failing before the window is full, because it doesn’t even read the whole thing evenly.
The craft of managing that bounded space, deciding what the model gets to see on every single call so the right things are present and the noise is gone, is called context engineering. It has quietly become one of the most important skills in building with AI, to the point that people say it replaced prompt engineering. Let me show you why forgetting happens, the study that proved the sneaky part, and the concrete moves teams use to fight it, with both everyday and developer examples.
The context window: a desk, not a filing cabinet
Picture the model’s memory not as a filing cabinet that stores everything, but as a desk of a fixed size. Whatever it’s working on has to be laid out on that desk right now. The system instructions, your question, the conversation so far, any files it’s referencing, all of it competes for the same finite surface. There is no “and also remember this from an hour ago” drawer. If it’s not on the desk, the model cannot see it. Full stop.
So what happens when the desk fills? The oldest papers slide off the far edge to make room. That is the plain mechanical reason an assistant forgets what you said early in a long chat: those tokens literally scrolled off the desk to fit the newer ones.
You might think: fine, just buy a bigger desk. Models have grown huge windows, hundreds of thousands of tokens, even millions. Problem solved? Not quite, and this is where it gets interesting.
The sneaky part: lost in the middle
In 2023, a team of researchers (Nelson Liu and colleagues, from Stanford, Berkeley, and Samaya AI) ran a careful study with a title that says it all: “Lost in the Middle.” They gave models a long context containing the one document that held the answer, and they slid that document around, sometimes near the front, sometimes buried in the middle, sometimes near the end, and measured how often the model found it.
The result is one of those findings you don’t forget. Performance traced a U-shape. Models were sharp when the key fact sat near the beginning or the end of the context, and noticeably worse when it sat in the middle, even though the whole thing fit comfortably in the window. A bigger desk doesn’t help if the model skims the middle of the page.
Put those two facts together and you have the whole motivation for context engineering:
- The window is finite, so you cannot just pour everything in.
- Even what fits is not read evenly, so more is often worse, extra noise in the middle actively hurts.
The goal, then, is not “give the model everything.” It is “give the model the right things, in the right places, and nothing else.” Curate the desk.
What goes wrong if you don’t
Before the fixes, feel the failure modes. Naively stuffing the window, or naively letting it overflow, produces predictable pain:
The techniques: how we fight forgetting
Here are the real moves. None is exotic; together they’re the toolkit. I’ll give each a plain-language and a developer angle.
Two of these deserve a real example, one general and one for the developers, because that’s where it stops being theory.
The general example, a long support chat. You’ve been troubleshooting your internet for forty messages. A well-engineered assistant doesn’t keep all forty on the desk. It quietly compacts: “User’s router is model X, already tried restarting and cable-swapping, both failed, ISP confirmed no outage.” Three lines now stand in for forty turns. The desk has room, the key facts survive, and it stops asking you to restart the router for the third time.
The developer example, a coding agent on a big repo. The agent cannot fit your whole codebase on the desk, not even close. So it doesn’t try. It retrieves: embeds your files, and when a task touches authentication, it pulls in just auth.ts and its tests, leaving the other 900 files out on the shelf at zero cost. When it finishes a sub-task, it prunes the noisy tool output and keeps only the outcome (“tests pass”). And it orders the prompt so the task and the key file sit at the edges, not lost among boilerplate. Same finite desk, used with discipline.
The production pattern: three tiers of memory
Put it all together and the shape that most serious 2026 systems land on is a three-tier memory, which is really just “the desk, plus two kinds of storage behind it.”
The single hardest judgment in that whole picture is when to compact. Do it too early, mid-derivation, and you throw away detail the model still needs. Do it too late and the desk overflows and things fall off before you saved them. Good systems compact at natural seams, when a sub-task just finished, when the reasoning has clearly converged, not in the middle of a hard step. Compaction is a decision about timing, not just a size threshold.
How much this actually saves
The payoff is not subtle. Compacting forty rambling turns into a three-line summary, or retrieving three relevant chunks instead of loading a whole knowledge base, cuts the tokens on the desk dramatically, which means faster replies, lower cost, and better answers, because the noise the model would have skimmed past in the middle is simply gone.
When to reach for which
A quick map, so you know which tool the situation calls for:
| Situation | Reach for | Why |
|---|---|---|
| Long conversation drifting | Compaction / summarization | Keep decisions, drop the chatter, free the desk |
| Huge knowledge base or codebase | Retrieval (embeddings) | Fetch only the few relevant pieces, leave the rest outside |
| Noisy tool outputs / logs piling up | Pruning | Keep the conclusion, discard the raw transcript |
| One critical instruction must land | Ordering | Put it at the start or end, never the middle |
| Facts must survive across sessions | External store (tier 3) | The desk is wiped each session; durable memory lives outside |
The one idea to carry
An AI model forgets for two honest reasons: its working memory has a hard edge, and even inside that edge it reads the middle poorly. So the job was never to hand it everything. It was to hand it the right things, placed where it reads best, and to keep doing that as the conversation grows, compressing the past, fetching what’s relevant, trimming the noise.
That is context engineering, and once you see it, you understand why the same model can feel brilliant in one product and forgetful in another. The model didn’t change. The care taken with what sits on its desk did. The intelligence was never only in the model. A good share of it lives in the quiet discipline of deciding what it gets to see.