How to Keep Context in a Prompt (Without Writing a Novel)
A practical, system-level approach to preserving context: pin what matters, summarize what doesn't, and route memory on purpose.
-0080.png&w=3840&q=75)
You can write the cleanest prompt in the world and still get mushy answers after a few turns. The model "forgets," repeats itself, or starts contradicting earlier constraints. People blame the model. I blame the context strategy.
Here's the thing: keeping context isn't about stuffing more text into the prompt. It's about controlling what the model sees, when, and in what form. If you treat the context window like a trash drawer where everything goes "just in case," you'll eventually get noise, conflicts, and drift.
In this post I'll show how I keep context stable over long chats and multi-step tasks by separating "always-true" constraints from working notes, summarizing aggressively, and (when needed) treating memory like an actual subsystem-not a scrolling chat log.
Context is not memory. It's working state.
Most LLM products feel like they "remember" because you can keep chatting. Under the hood, the model only conditions on a limited window of tokens at inference time. When you exceed that, something gets dropped or compressed. So the real question isn't "how do I keep context," it's "which parts of context deserve to survive?"
Research on agent memory systems frames this as a mismatch problem: the granularity you store and retrieve (raw chat logs, chunks, summaries) often doesn't match the granularity a task needs in the moment. That mismatch either creates irrelevant noise (too coarse) or breaks reasoning dependencies (too fragmented). AMA (Adaptive Memory via Multi-Agent Collaboration) is one concrete design that attacks this head-on using multi-granularity memory plus an explicit routing and conflict-checking loop [1]. Even if you're not building an agent, the mental model is gold: store different kinds of context differently, and fetch them intentionally.
Also worth calling out: more context isn't automatically better. Systems that can "optimize context" tend to win by staying relevant and consistent, not by being verbose. Meta Context Engineering (MCE) treats context as something you actively engineer and evolve-balancing brevity versus bloat and optimizing for performance per token [2]. That's basically the grown-up version of prompt hygiene.
So, in practice, I keep context by enforcing three layers: pinned constraints, rolling working memory, and retrieval/summaries.
My three-layer method: pin, prune, and pull
Layer 1 is pinned context: the stuff that should not drift. Your role, your output format, your allowed tools, your definitions, your non-negotiable constraints. It belongs at the top of the prompt every time (or in your system/developer message if you're building an app).
Layer 2 is working context: short-lived notes. The current task step, the draft you're editing, the trade-offs you're exploring right now. This is the layer that you prune ruthlessly.
Layer 3 is pulled context: anything you don't want to keep "hot" in the window. This includes long documents, old decisions, earlier brainstorming, and prior outputs. This layer is either retrieved on demand (search/RAG) or compressed into a summary.
AMA's design is useful because it formalizes these layers as different memory types (raw text vs facts vs episodes) and routes queries to the right store based on intent [1]. MCE adds the meta-lesson: different tasks and models want different context shapes and lengths, so you shouldn't hardcode one forever [2].
The practical takeaway: if you want a model to stay coherent across 30 turns, you must re-state the stable stuff, and compress or externalize the rest.
Practical prompt patterns that actually keep context stable
I'm going to stay opinionated here: if you don't have an explicit "context section," you're winging it.
A simple (but effective) pattern is to split your input into Context vs Task vs Output Contract. Community users often describe this as "context first, prompt second," and even though the examples are informal, the underlying idea matches what memory papers implement as routing/structuring: separate durable constraints from the immediate instruction [5].
Here's a template I use constantly:
<CONTEXT>
You are: {role}
Audience: {audience}
Goal: {goal}
Non-negotiables: {constraints}
Definitions:
- {term}: {definition}
</CONTEXT>
<WORKING_STATE>
Current step: {step}
Decisions so far (latest wins):
- ...
Open questions:
- ...
</WORKING_STATE>
<TASK>
Do X with Y. If information is missing, ask up to 3 questions.
</TASK>
<OUTPUT_CONTRACT>
Return:
1) ...
2) ...
Format: ...
</OUTPUT_CONTRACT>
Two small details make this work better than it looks. First, the "latest wins" line prevents contradictions from lingering. Second, the explicit "ask up to 3 questions" stops the model from silently guessing when context is thin.
Now the hard part: you must keep <WORKING_STATE> short. When it gets bloated, you don't "add more tokens." You summarize.
Summarize like an engineer: facts, decisions, and rationale
If you've ever watched a long chat go off the rails, it usually fails because old constraints conflict with new ones, or because the model is optimizing for the most recent messages and inventing glue.
AMA tackles this with a Judge/Refresher loop: retrieve candidates, check relevance, detect conflicts, then update or delete outdated memory entries [1]. You can steal the spirit of that without building the whole system.
When I summarize, I force three buckets:
<SUMMARY_UPDATE>
Facts (stable):
- ...
Decisions (latest wins):
- ...
Rationale (1 line each):
- ...
</SUMMARY_UPDATE>
This structure matters. "Facts" are the things you want to persist. "Decisions" are the choices you made under uncertainty. "Rationale" is the anti-amnesia: it prevents you from re-litigating the same argument every 10 turns.
If you want to be extra strict, add: "If any future instruction conflicts with Decisions, point it out and ask which to follow." That single line prevents silent drift.
When you need real long-term context: treat memory as a subsystem
At some point-coding agents, research assistants, multi-week projects-you're not "prompting" anymore. You're running a process.
This is where the research direction becomes directly practical. MCE describes context as a function that can include retrieval, filtering, formatting, and composition logic-not just static text [2]. AMA shows the concrete wins from routing (query rewriting + intent detection), multi-granularity storage, and explicit conflict resolution [1]. METIS, a stage-aware research mentor, includes a "session memory" that maintains stage and constraints across turns so the user can progress over weeks [3]. That's basically "context persistence" done intentionally.
So if you're building an app, the best way to "keep context in prompt" is often: don't. Keep context in storage. Retrieve the minimal slice needed for the current step. Summarize state transitions. And periodically "refresh" your canonical context (your pinned constraints + current plan).
Practical examples (and how people actually talk about it)
A Reddit thread I liked framed it as: most people obsess over the "prompt," but the real lever is the surrounding situation-who it's for, what the goal is, what constraints exist [6]. Another post argued for an explicit separation: write your context first (role, domain boundaries, constraints), then a minimal instruction [5]. I agree with the framing, even if I'd tighten the execution: context isn't "make it longer," it's "make it structured and conflict-resistant."
Try this prompt the next time a conversation starts getting messy:
We're going to reset working memory without losing requirements.
1) Create a compact "Pinned Context" (max 12 lines) with role, goal, constraints, and output format.
2) Create a "Working State" (max 10 lines) listing current step + latest decisions.
3) List any conflicts you detect between earlier constraints and recent instructions.
4) Ask up to 3 questions if anything is ambiguous.
Then proceed with the task: {your task}
That's your manual "compactor." It's not magic, but it stops the slow drift that kills long sessions.
Closing thought
If you want consistent outputs, stop treating context as "chat history" and start treating it as state. Pin what must not change. Prune what's temporary. Pull what's needed on demand. And when things get long, summarize into facts and decisions-don't just keep scrolling.
That's how you keep context in a prompt without turning your prompt into a landfill.
References
Documentation & Research
- AMA: Adaptive Memory via Multi-Agent Collaboration - arXiv - https://arxiv.org/abs/2601.20352
- Meta Context Engineering via Agentic Skill Evolution - arXiv - https://arxiv.org/abs/2601.21557
- METIS: Mentoring Engine for Thoughtful Inquiry & Solutions - arXiv - https://arxiv.org/abs/2601.13075
Community Examples
- How do you handle and maintain context? - r/PromptEngineering - https://www.reddit.com/r/PromptEngineering/comments/1qpdmqa/how_do_you_handle_and_maintain_context/
- An Alternative View - r/ChatGPTPromptGenius - https://www.reddit.com/r/ChatGPTPromptGenius/comments/1qo0rr5/an_alternative_view/
- I can do anything… just tell me who, why, and for what...?? - r/ChatGPT - https://www.reddit.com/r/ChatGPT/comments/1qq6t14/i_can_do_anything_just_tell_me_who_why_and_for/
Related Articles
-0124.png&w=3840&q=75)
Perplexity AI: How to Write Search Prompts That Actually Pull the Right Sources
A practical way to prompt Perplexity like a research assistant: tighter questions, better constraints, and built-in verification loops.
-0123.png&w=3840&q=75)
How to Write Prompts for Grok (xAI): A Practical Playbook for Getting Crisp, Grounded Answers
A developer-friendly guide to prompting Grok: structure, constraints, iterative refinement, and how to test prompts like a product.
-0122.png&w=3840&q=75)
Best Prompts for Llama Models: Reliable Templates for Llama 3.x Instruct (and Local Runtimes)
Prompt patterns that consistently work on Llama Instruct models: formatting, role priming, structured outputs, and safety-aware prompting.
-0121.png&w=3840&q=75)
GPT-5.2 Prompts vs Claude 4.6 Prompts: What Actually Changes (and What Doesn't)
A practical, prompt-engineering comparison between GPT-5.2 and Claude 4.6: where wording matters, where it doesn't, and how to write prompts that transfer.
