Prompt Tips•Jan 30, 2026•8 min

Keeping Context in a Prompt: The 3-Layer Pattern That Stops Long Chats From Derailing

A practical way to preserve context in prompts using a stable brief, rolling memory, and focused task blocks-without blowing your token budget.

Most "context problems" aren't model problems. They're prompt architecture problems.

When people say "the model lost context," what they usually mean is: "I let the conversation grow until the important stuff got diluted, contradicted, or buried." And then the model did the most predictable thing in the world: it optimized for whatever was most recent, loudest, and easiest to pattern-match.

What's interesting is that the fix is rarely "add more context." In many systems, more information can actually make performance worse because it increases noise and conflict. That "less-is-more" effect shows up even in research on LLM monitoring: giving the monitor more transcript content can reduce detection performance, and filtering/extracting the relevant bits often helps [2]. Different domain, same core lesson: context isn't "everything available," it's "everything that matters."

So let's talk about how to keep context in a prompt, in a way that survives long chats, shifting constraints, and real product usage.

The context window is not your memory. It's your budget.

At a mechanical level, the model only "sees" what you send in the current request. In chat UIs that feels like memory, but it's really just a rolling buffer. The practical takeaway is brutal: if you don't restate what matters, you're betting your outcomes on accidental placement in the prompt.

That's why I like to treat context as a deliberately managed artifact, not a side effect of conversation.

And here's the first opinionated rule: context should have structure. Not vibes. Not "remember what I said earlier." Structure.

A good structure does two things:

It makes the model's job easier by labeling what each piece of text is (requirements vs. examples vs. history). And it makes your job easier by giving you a place to update "what we know" without rewriting everything.

The 3-layer pattern: Stable Brief → Rolling Memory → Task Block

When I want a model to stay coherent over time, I separate the prompt into three layers.

Layer 1 is a stable brief. This is the stuff that should be true for the whole session: role, quality bar, audience, hard constraints, definitions. You don't rewrite this every turn unless the project changes.

Layer 2 is rolling memory. This is the smallest possible summary of what happened so far that still lets the model act correctly. Think: decisions, open questions, current state, and a few "do not forget" constraints.

Layer 3 is the task block. This is the single thing you want right now, with any local input included.

The reason this pattern works is that it's basically "information filtering" applied to prompting. Instead of pasting a huge transcript, you hand the model the distilled state it needs. Again: less can beat more [2].

Here's a template I've used in production.

<SYSTEM_BRIEF>
You are my assistant for [domain]. Optimize for: correctness, clarity, and constraint adherence.
Hard constraints:
- [constraint 1]
- [constraint 2]
Definitions:
- [term] = ...
Output format: ...
</SYSTEM_BRIEF>

<ROLLING_MEMORY>
Current goal: ...
Decisions made: ...
Known facts (trusted): ...
Constraints (active): ...
Open questions: ...
</ROLLING_MEMORY>

<TASK>
Do this now: ...
Inputs:
- ...
Acceptance criteria:
- ...
</TASK>

If you do just one thing after reading this article, do this. It's the simplest "context retention system" that doesn't require tools, vector DBs, or fancy agent frameworks.

Why this beats "just keep chatting"

Long chats create three predictable failure modes:

First, contradictions accumulate. You say "keep it short," then later you ask for more detail, then later you say "short again," and now you've got conflicting instructions spread across a transcript.

Second, the "state" gets implicit. The model has to infer what decisions were final, what was brainstorming, what was rejected, and what was agreed.

Third, your most important constraints stop being salient. They're technically still in the transcript, but they're no longer foregrounded.

Rolling memory fixes all three by turning the chat into a state machine: we explicitly store what matters, and we rewrite it when it changes.

There's also a nice parallel to research assistants that explicitly maintain session context across long workflows. METIS, for example, is built around stage-aware routing plus session memory so learners can "make steady progress over weeks" [1]. You don't need METIS to benefit from the idea. You just need to stop pretending the transcript is a database.

Make context "addressable" with labels and delimiters

A hidden superpower of the 3-layer pattern is that the model can cite and reason about parts of context. Humans do this naturally. Models need help.

So I'll often add a line like:

"Treat <SYSTEM_BRIEF> as non-negotiable unless I explicitly override it."

And I'll explicitly tell the model how to update memory:

"Before answering, propose updates to <ROLLING_MEMORY> if needed."

This gives you an interaction rhythm that stays stable even when tasks change.

Keep it small on purpose: summarize, don't accumulate

If your rolling memory grows forever, you rebuilt the transcript problem.

My rule of thumb is: rolling memory should fit on one screen. If it doesn't, it's not memory-it's a second transcript.

This aligns with the "extract and evaluate" style of filtering described in monitoring research, where an extractor isolates relevant excerpts and an evaluator uses only those excerpts to judge [2]. In prompting terms, you are the extractor. Your model is the evaluator.

Practical examples (what I'd actually paste into ChatGPT/Claude)

Someone on Reddit put it bluntly: the "prompt" is the request, but the "context" is the guardrails, and mixing them is why prompts bloat and still underperform [3]. I mostly agree, even if I'd phrase it less poetically. Here's how I'd apply it to two common situations.

Example 1: Keep context across an evolving product spec

<SYSTEM_BRIEF>
You are a senior product + backend engineer pair. Be concrete. If requirements conflict, ask a question.
We are building: a SaaS onboarding checklist feature.
Tech: Node.js, Postgres. Multi-tenant. EU privacy constraints.
Output: short sections, no fluff.
</SYSTEM_BRIEF>

<ROLLING_MEMORY>
Current goal: design v1 data model + API endpoints.
Decisions made:
- Checklist templates are defined per workspace.
- Each user has checklist progress per template.
Constraints:
- No PII in event logs.
Open questions:
- Do we need per-item due dates or just ordering?
</ROLLING_MEMORY>

<TASK>
Propose a Postgres schema (tables + key columns) and 5 REST endpoints.
Include how you would handle multi-tenancy and migrations.
</TASK>

Next turn, instead of "remember what we decided," you update <ROLLING_MEMORY> with the new decision and move on.

Example 2: Long brainstorming that starts hallucinating constraints

This is straight out of what people complain about in long conversations: things get redundant, self-conflicting, and the model starts inventing [4]. The fix is to reset the state.

<SYSTEM_BRIEF>
You are a strategy analyst. Only use facts inside ROLLING_MEMORY or explicitly marked assumptions.
If you make an assumption, label it ASSUMPTION.
</SYSTEM_BRIEF>

<ROLLING_MEMORY>
Goal: pick 1 positioning angle for a landing page.
Facts:
- Audience: CTOs at 50-200 person SaaS companies
- Product: hosted feature flagging with audit logs
- Differentiator: extremely fast setup + good RBAC
Constraints:
- No claims about "SOC 2" unless explicitly stated
</ROLLING_MEMORY>

<TASK>
Give me 3 positioning angles with a 1-sentence headline + 2-sentence rationale.
Then ask 3 questions that would let you pick the best one.
</TASK>

You'll notice what's missing: the transcript. That's the point.

Closing thought: context is a product feature, not a prompt trick

If you're building anything beyond one-off chats, "keeping context" is your product's job. Your prompt is just the interface.

The most reliable approach I've seen is treating context like data: define it, version it, compress it, and refresh it. The 3-layer pattern gives you a simple starting architecture that scales from solo use to teams to agents.

Try it for a week. The first time you avoid a 200-message thread by pasting a 20-line rolling memory, you won't go back.

References

METIS: Mentoring Engine for Thoughtful Inquiry & Solutions - arXiv cs.LG
https://arxiv.org/abs/2601.13075
How does information access affect LLM monitors' ability to detect sabotage? - arXiv cs.AI
https://arxiv.org/abs/2601.21112

Community Examples

An Alternative View - r/ChatGPTPromptGenius
https://www.reddit.com/r/ChatGPTPromptGenius/comments/1qo0rr5/an_alternative_view/
How do you handle and maintain context? - r/PromptEngineering
https://www.reddit.com/r/PromptEngineering/comments/1qpdmqa/how_do_you_handle_and_maintain_context/

Ilia Ilinskii

Founder of Rephrase-it. Building tools to help humans communicate with AI.

Prompt Tips•9 min

Perplexity AI: How to Write Search Prompts That Actually Pull the Right Sources

A practical way to prompt Perplexity like a research assistant: tighter questions, better constraints, and built-in verification loops.

Prompt Tips•10 min

How to Write Prompts for Grok (xAI): A Practical Playbook for Getting Crisp, Grounded Answers

A developer-friendly guide to prompting Grok: structure, constraints, iterative refinement, and how to test prompts like a product.