Learn why reasoning reuse beats bigger context for multi-turn agents, and how preserve_thinking reduces drift, cost, and repeated mistakes. Read on.
Most multi-turn agents do not fail because they lack a giant context window. They fail because they keep forgetting what already mattered.
Bigger context is not enough because multi-turn agents do not just need more room. They need a stable, task-aligned working state that can survive long interactions without drowning in stale details, misplaced assumptions, or irrelevant tokens.[1][2]
Here's the core mistake I keep seeing: people assume a 200k or 1M token window automatically solves agent memory. It doesn't. A bigger window gives you storage capacity. It does not guarantee retrieval quality, attention quality, or clean execution. The SoK on Agentic RAG says this directly: long-context models still need structured context selection because performance degrades depending on where relevant information appears in long inputs.[2]
That point matters more than it sounds. In a real agent loop, every turn adds more observations, tool outputs, plans, guesses, and partial conclusions. If you just keep appending that stream, the agent eventually reasons over clutter. ARC calls this context rot: the agent's internal state becomes less coherent and less aligned as the task stretches on.[1]
So the real bottleneck is not token capacity. It's state quality.
Preserve_thinking means carrying forward the useful reasoning state from earlier turns in a compact, reusable form instead of forcing the model to reconstruct it from raw history every time.
I'd define it as preserving the conclusions, priorities, and decision context that still matter now. Not every chain-of-thought token. Not every dead-end search. Not every discarded hypothesis. Just the parts that should keep shaping the next move.
That distinction lines up with recent long-horizon agent work. ARC separates action execution from context management and maintains an interaction memory plus a checklist that can be revised over time.[1] UI-Copilot goes even more concrete: it keeps only concise progress summaries in the live dialogue while storing detailed reasoning externally for retrieval on demand.[3]
That is basically the preserve_thinking idea in architecture form. The agent does not need to reread every old thought. It needs access to the right distilled thought.
This is also where reasoning reuse becomes more interesting than plain memory. Memory says, "store what happened." Reasoning reuse says, "store what was learned and make it usable again."
Reasoning reuse beats transcript replay because replay preserves volume, while reuse preserves signal. Multi-turn agents need the second one far more.
Transcript replay looks safe. In theory, if you keep the full trace, nothing is lost. In practice, everything important gets buried. ARC shows that raw accumulation leads to attention dilution, while passive summarization alone still lets early mistakes persist.[1] UI-Copilot reports similar failure modes in GUI agents: memory degradation, progress confusion, and math hallucination when too much reasoning is mixed into the active context.[3]
What works better is a split model:
| Approach | What it keeps live | Main failure mode | Better use case |
|---|---|---|---|
| Raw transcript replay | Everything | Attention dilution, drift | Short tasks |
| Bigger context only | More of everything | Relevance collapse | Broad document intake |
| Summary plus retrieval | Progress + on-demand details | Summary quality risk | Long multi-turn tasks |
| Reasoning reuse | Distilled conclusions and strategies | Requires memory design | Persistent agents |
This is why I think preserve_thinking is underrated. It's not about preserving every thought. It's about preserving the right cognitive residue.
A good agent should be able to say: "We already established X. Y was a dead end. Z is still unresolved. Continue from there."
That is much closer to how competent humans work too.
Multi-turn agents should preserve reasoning state through compact summaries, explicit task checklists, and retrieval of prior insights, with the ability to revise those artifacts when later evidence proves them wrong.[1][2][3]
Notice the last part: revise. This is the catch.
If you only compress history, you may compress mistakes too. ARC's main contribution is showing that context management should be active and reflection-driven, not just passive summarization.[1] The system updates memory every turn, checks for degradation, and can reorganize the working context when it detects misalignment. That's a lot closer to preserve_thinking than "stuff old messages into a longer prompt."
UI-Copilot reaches a similar result from another angle. It uses a multi-turn summary for active progress tracking while detailed observations are stored separately and retrieved only when needed.[3] That reduces overload and keeps the execution context lighter.
If you're designing prompts or agent scaffolds, I'd turn that into a simple operating rule:
Tools like Rephrase can help you phrase these instructions clearly when you're building prompts for agent frameworks, especially if you want the model to maintain a structured running state instead of dumping verbose thoughts every turn. And if you want more prompting workflows like this, the Rephrase blog is a good rabbit hole.
Reasoning reuse in prompts looks like telling the model to maintain and update reusable internal artifacts, not just continue a chat transcript.
Here's a simple before-and-after.
| Before | After |
|---|---|
| "Continue the task from the previous messages." | "Before acting, update a running state with: current goal, confirmed facts, failed attempts, open questions, and next best action. Reuse prior confirmed conclusions unless contradicted by new evidence." |
And here's a stronger pattern for agent prompts:
Maintain a compact working memory across turns.
At each turn:
1. Update confirmed facts.
2. Mark invalidated assumptions.
3. Keep a short checklist of remaining subgoals.
4. Reuse prior conclusions instead of re-deriving them.
5. Retrieve detailed prior reasoning only if needed to resolve the current step.
Do not copy the full transcript into the active reasoning state.
Prefer concise, revisable summaries over raw history.
That instruction does two things. First, it reduces pointless recomputation. Second, it reduces anchor drift, where the model keeps reinterpreting the problem from scratch.
I've noticed this is especially useful for research agents, coding agents, and ops assistants that touch multiple tools over many turns. Bigger context makes them able to carry more. Reasoning reuse makes them able to stay coherent.
This is more important because context length is a capacity upgrade, while reasoning reuse is an architecture upgrade. One gives you more room. The other changes how the room is organized.
The research trend is pretty clear. ARC improves long-horizon performance by actively managing context, not by simply expanding it.[1] The Agentic RAG survey frames memory, pruning, and retrieval as core design choices even in long-context settings.[2] UI-Copilot shows that decoupling progress tracking from detailed reasoning reduces confusion in long tasks.[3]
So my take is simple: if your agent fails after 20 turns, don't assume the fix is 10x more context. The fix is often better preservation of the reasoning state it already produced.
That's the feature preserve_thinking points toward. And yes, it matters more than bigger context for any agent expected to work across sessions, tools, and evolving subtasks.
If you're prompting these systems manually, start there. If you're doing it all day, automate the cleanup and rewriting step with something like Rephrase so your prompts consistently ask for reusable state instead of bloated history.
Documentation & Research
Community Examples 4. Google Cloud AI Research Introduces ReasoningBank: A Memory Framework that Distills Reasoning Strategies from Agent Successes and Failures - MarkTechPost (link)
Reasoning reuse means an agent can preserve useful intermediate conclusions, plans, and lessons from prior turns instead of regenerating them from scratch. Done well, it improves consistency, lowers cost, and reduces repeated mistakes.
In practice, preserve_thinking means keeping a compact, task-relevant representation of the agent's prior reasoning state available across turns. That can include summaries, checklists, retrieved memories, or reusable reasoning strategies rather than full raw transcripts.