Talk to an AI assistant long enough and you will feel it happen. Something you told it near the start, your name, a constraint, a decision you both agreed on, quietly stops being true for it. It contradicts itself. It asks for a thing you already gave it. It forgets.

This is not the model being dumb. It is the model hitting a wall that every one of them has: a fixed-size working memory, and a hard edge to it. Everything the model can “see” at once, your message, its own past replies, any documents, the instructions, all of it has to fit inside one bounded space called the context window. Fill it up, and something has to go. And here is the twist that surprised even the researchers: the model starts failing before the window is full, because it doesn’t even read the whole thing evenly.

The craft of managing that bounded space, deciding what the model gets to see on every single call so the right things are present and the noise is gone, is called context engineering. It has quietly become one of the most important skills in building with AI, to the point that people say it replaced prompt engineering. Let me show you why forgetting happens, the study that proved the sneaky part, and the concrete moves teams use to fight it, with both everyday and developer examples.

The context window: a desk, not a filing cabinet

Picture the model’s memory not as a filing cabinet that stores everything, but as a desk of a fixed size. Whatever it’s working on has to be laid out on that desk right now. The system instructions, your question, the conversation so far, any files it’s referencing, all of it competes for the same finite surface. There is no “and also remember this from an hour ago” drawer. If it’s not on the desk, the model cannot see it. Full stop.

So what happens when the desk fills? The oldest papers slide off the far edge to make room. That is the plain mechanical reason an assistant forgets what you said early in a long chat: those tokens literally scrolled off the desk to fit the newer ones.

desk is full
system instructions (pinned)
turn 1: "my name is Sreenath, I use TypeScript"
turn 2: "the budget is fixed, don't exceed it"
turn 14: recent back-and-forth
turn 15: your latest message
newest stays · oldest slides off to make room
A context window is a desk of fixed size measured in tokens. When a long conversation overflows it, the earliest turns fall off the edge, which is exactly the stuff you set up at the start ("my name is...", "don't exceed the budget"). The model isn't ignoring you. That information is no longer on the desk.

You might think: fine, just buy a bigger desk. Models have grown huge windows, hundreds of thousands of tokens, even millions. Problem solved? Not quite, and this is where it gets interesting.

The sneaky part: lost in the middle

In 2023, a team of researchers (Nelson Liu and colleagues, from Stanford, Berkeley, and Samaya AI) ran a careful study with a title that says it all: “Lost in the Middle.” They gave models a long context containing the one document that held the answer, and they slid that document around, sometimes near the front, sometimes buried in the middle, sometimes near the end, and measured how often the model found it.

The result is one of those findings you don’t forget. Performance traced a U-shape. Models were sharp when the key fact sat near the beginning or the end of the context, and noticeably worse when it sat in the middle, even though the whole thing fit comfortably in the window. A bigger desk doesn’t help if the model skims the middle of the page.

high low accuracy beginning middle end
The "lost in the middle" U-curve, redrawn. Put the crucial fact where the model actually reads best, the start or the end, and it thrives. Bury it dead-center among distractors and accuracy sags, no matter how large the window. Position is not neutral. Where a fact sits changes whether the model uses it.

Put those two facts together and you have the whole motivation for context engineering:

  1. The window is finite, so you cannot just pour everything in.
  2. Even what fits is not read evenly, so more is often worse, extra noise in the middle actively hurts.

The goal, then, is not “give the model everything.” It is “give the model the right things, in the right places, and nothing else.” Curate the desk.

What goes wrong if you don’t

Before the fixes, feel the failure modes. Naively stuffing the window, or naively letting it overflow, produces predictable pain:

1
It forgets the setup. Early constraints scroll off the desk. The assistant cheerfully violates a rule you gave it in turn two.
2
It drowns in noise. Twenty half-relevant documents crammed in, the one that mattered sits in the middle, and gets skimmed right past.
3
It gets slow and expensive. Every token on the desk is paid for and processed on every single call. A bloated context is a bigger bill and a slower reply, every turn.
4
It distracts itself. Old, stale, or contradictory turns still on the desk pull the model toward outdated answers. More context can mean worse answers.
The counterintuitive lesson underneath all four: context is not free, and more of it is not better. Every token competes for attention and costs money. The skill is subtraction as much as addition.

The techniques: how we fight forgetting

Here are the real moves. None is exotic; together they’re the toolkit. I’ll give each a plain-language and a developer angle.

summarize Compaction / summarization When history grows long, replace old turns with a tight summary that keeps the decisions and drops the chatter. The desk clears; the gist stays.
retrieve Retrieval (bring back only what's needed) Instead of keeping everything on the desk, store it outside and fetch just the few relevant facts when a turn needs them. This is where embeddings earn their keep.
prune Pruning / trimming Drop tool outputs, raw logs, and dead turns once they've served their purpose. Keep the conclusion, throw away the transcript that produced it.
order Ordering (beat the U-curve) Put the most important material at the start and the end, never buried in the middle. Position it where the model actually reads well.
Four everyday moves: shrink it, fetch it, trim it, place it well. Summarization and pruning fight the finite-desk problem; retrieval sidesteps it entirely; ordering fights the lost-in-the-middle problem. Most real systems use all four together.

Two of these deserve a real example, one general and one for the developers, because that’s where it stops being theory.

The general example, a long support chat. You’ve been troubleshooting your internet for forty messages. A well-engineered assistant doesn’t keep all forty on the desk. It quietly compacts: “User’s router is model X, already tried restarting and cable-swapping, both failed, ISP confirmed no outage.” Three lines now stand in for forty turns. The desk has room, the key facts survive, and it stops asking you to restart the router for the third time.

The developer example, a coding agent on a big repo. The agent cannot fit your whole codebase on the desk, not even close. So it doesn’t try. It retrieves: embeds your files, and when a task touches authentication, it pulls in just auth.ts and its tests, leaving the other 900 files out on the shelf at zero cost. When it finishes a sub-task, it prunes the noisy tool output and keeps only the outcome (“tests pass”). And it orders the prompt so the task and the key file sit at the edges, not lost among boilerplate. Same finite desk, used with discipline.

The production pattern: three tiers of memory

Put it all together and the shape that most serious 2026 systems land on is a three-tier memory, which is really just “the desk, plus two kinds of storage behind it.”

1
Working memory (on the desk)The live context window: the current turn, full and lossless. Fast, precise, and small. This is the only thing the model can directly see.
2
Compressed session memoryAs the session grows, older stretches get summarized into compact form. Continuity within the session, without keeping every raw turn on the desk.
3
External persistent storeCross-session, long-term knowledge saved outside entirely (a vector store, a database). Retrieved back onto the desk only when relevant, at the start of a session or mid-task.
Tier 1 is the desk. Tier 2 is the notepad of summaries beside it. Tier 3 is the filing cabinet in the other room. The art is moving information between them at the right moments: compress before the desk overflows, save the durable facts before they're lost, and fetch them back only when they matter.

The single hardest judgment in that whole picture is when to compact. Do it too early, mid-derivation, and you throw away detail the model still needs. Do it too late and the desk overflows and things fall off before you saved them. Good systems compact at natural seams, when a sub-task just finished, when the reasoning has clearly converged, not in the middle of a hard step. Compaction is a decision about timing, not just a size threshold.

How much this actually saves

The payoff is not subtle. Compacting forty rambling turns into a three-line summary, or retrieving three relevant chunks instead of loading a whole knowledge base, cuts the tokens on the desk dramatically, which means faster replies, lower cost, and better answers, because the noise the model would have skimmed past in the middle is simply gone.

Naive: whole history + every doc on the deskslow, costly, noisy
Engineered: summary + retrieved facts onlylean, cheap, sharp
Tokens sitting on the desk, before and after context engineering, for the same task. The engineered version isn't just cheaper and faster. It's often *more accurate*, because it removed the middle-of-the-context noise the model would have gotten lost in. Less, placed well, beats more.

When to reach for which

A quick map, so you know which tool the situation calls for:

SituationReach forWhy
Long conversation driftingCompaction / summarizationKeep decisions, drop the chatter, free the desk
Huge knowledge base or codebaseRetrieval (embeddings)Fetch only the few relevant pieces, leave the rest outside
Noisy tool outputs / logs piling upPruningKeep the conclusion, discard the raw transcript
One critical instruction must landOrderingPut it at the start or end, never the middle
Facts must survive across sessionsExternal store (tier 3)The desk is wiped each session; durable memory lives outside
Most real systems combine several of these. A coding agent retrieves the right files, prunes stale output, compacts long sessions, and orders the prompt carefully, all at once. Context engineering is choosing the right mix for the moment.

The one idea to carry

An AI model forgets for two honest reasons: its working memory has a hard edge, and even inside that edge it reads the middle poorly. So the job was never to hand it everything. It was to hand it the right things, placed where it reads best, and to keep doing that as the conversation grows, compressing the past, fetching what’s relevant, trimming the noise.

That is context engineering, and once you see it, you understand why the same model can feel brilliant in one product and forgetful in another. The model didn’t change. The care taken with what sits on its desk did. The intelligence was never only in the model. A good share of it lives in the quiet discipline of deciding what it gets to see.

← Back to blog