Blog / Prompt engineering / How LLM Agent Memory Should Work

How LLM Agent Memory Should Work

Learn how episodic, semantic, and procedural memory fit together in LLM agents, and how to design a memory architecture that scales. Try free.

Ilia Ilinskii
Rephrase · April 18, 2026

Prompt engineering8 min read

On this page

Key Takeaways What is agent memory architecture for LLMs?Why do LLM agents need episodic, semantic, and procedural memory?How should an LLM agent move information across memory types?What does good retrieval look like in an agent memory system?What mistakes break agent memory architectures?How should you design memory prompts for agents?References

Most agent memory setups look smart in diagrams and dumb in production. The reason is simple: they store too much raw history and call it memory.

Key Takeaways

Episodic memory should store what happened, not try to be the final thing the agent reasons over.
Semantic memory should hold distilled facts, concepts, and stable preferences derived from experience.
Procedural memory should capture reusable strategies, workflows, and "how to" patterns for future tasks.
The best LLM agent architectures treat memory as a write-manage-read loop, not just vector search.
Retrieval quality depends heavily on structure. Good memory architecture beats bigger context windows surprisingly often.

What is agent memory architecture for LLMs?

Agent memory architecture is the system that decides what an LLM agent stores, how it organizes it, and what it retrieves later to make better decisions. In practice, good architectures separate raw experience from distilled knowledge so the agent can reason with compact, relevant memory instead of replaying entire histories. [1][2]

Here's the key distinction I keep coming back to: not all memory should be treated equally. In recent agent research, episodic memory is the raw trace of interactions, semantic memory is the factual layer abstracted from those traces, and procedural memory is the reusable action layer that captures how to solve tasks. PlugMem makes this separation explicit and uses episodic memory as the grounding layer from which semantic and procedural knowledge are extracted. [1]

That matches the broader survey view too. The most useful way to think about agent memory is as a continuous write-manage-read loop. Agents don't just save things. They write, consolidate, retrieve, update, and sometimes forget. If you skip the management part, memory turns into clutter fast. [2]

Why do LLM agents need episodic, semantic, and procedural memory?

LLM agents need different memory types because each one solves a different failure mode: episodic memory preserves concrete past events, semantic memory supports stable knowledge reuse, and procedural memory helps the agent repeat successful strategies. Using only one layer usually creates either context bloat or shallow recall. [1][2]

Episodic memory is the "what happened" layer. In PlugMem, it's formalized as structured observation-action traces rather than loose text blobs. That matters because raw episodes are useful for verification and reconstruction, but they're noisy as direct reasoning input. [1]

Semantic memory is the "what tends to be true" layer. This is where you store facts like user preferences, known constraints, or abstracted domain knowledge. The benefit is obvious: the agent no longer has to reread ten prior conversations to remember that a user prefers concise answers or that a given API has a fixed rate limit. [1][2]

Procedural memory is the "how to do it" layer. This is the underrated one. It stores reusable action patterns: how to filter products on a shopping site, how to debug a flaky script, how to work through a multi-step workflow. PlugMem represents this as intent-prescription pairs, which I think is the right framing: the goal and the method belong together. [1]

The table below is the simplest way to see the difference.

Memory type	Stores	Best used for	Main risk
Episodic	Specific interactions, actions, observations	Grounding, auditability, reconstruction	Too verbose for direct use
Semantic	Facts, concepts, stable preferences	Fast retrieval of reusable knowledge	Can drift or oversimplify
Procedural	Strategies, workflows, action patterns	Reusing successful task methods	Can become stale if environment changes

How should an LLM agent move information across memory types?

A strong LLM agent should first capture raw episodes, then distill them into semantic facts and procedural strategies, while keeping provenance back to the original episode. This gives the agent both abstraction and traceability, which is exactly what most flat memory systems lack. [1]

This is probably the most important design choice in the whole architecture.

PlugMem argues that episodic memory is the substrate from which semantic and procedural memory are derived. It extracts propositions for semantic memory and prescriptions for procedural memory, while linking both back to source episodes through provenance edges. [1] That last part is crucial. If a retrieved "fact" or "workflow" can't be traced back to what actually happened, debugging gets ugly fast.

What I noticed across the broader literature is that many memory systems stop at retrieval. They index chunks, run similarity search, and hope the right passage comes back. But more recent work shows that performance depends heavily on whether the agent can organize memory into the right structure in the first place. StructMemEval found that memory-augmented agents do much better when they are explicitly prompted or designed to structure their memory, while naive retrieval systems struggle on tasks like ledgers, trees, and state tracking. [3]

So the flow I recommend looks like this:

Write raw interaction traces into episodic memory.
Extract durable facts into semantic memory.
Extract reusable workflows into procedural memory.
Keep links from abstractions back to episodes.
Retrieve different memory types depending on the current task.

If you want more articles on building better AI workflows, the Rephrase blog covers practical prompt and agent design patterns like this in a pretty no-nonsense way.

What does good retrieval look like in an agent memory system?

Good agent retrieval selects the right memory type for the current task, then returns compact, decision-relevant information instead of dumping long transcripts into the prompt. The best systems use structure to narrow the search space and use reasoning to compress the final memory payload. [1][2]

This is where many agent demos fall apart.

PlugMem's retrieval module first decides whether the agent should emphasize episodic, semantic, or procedural memory. It then retrieves over semantic and procedural graphs, using high-level concepts or intents as routing signals before surfacing low-level propositions or prescriptions. [1] In plain English: retrieve with abstraction first, specificity second.

That pattern matters because raw similarity search often gives you the wrong kind of "relevant." Something can be semantically similar without being useful. The broader survey makes the same point from a different angle: memory is not just about bigger context or better recall. It's about maintaining a sufficient internal state for good action selection under limited compute and context budgets. [2]

Here's a quick before-and-after prompt pattern that shows the difference.

Before

Use the chat history and help me continue the task.

After

You are continuing an ongoing task.

First, retrieve:
1. The most relevant semantic facts and constraints
2. The most relevant procedural strategy for this task type
3. Only the episodic traces needed to verify ambiguous details

Then produce:
- the next best action
- the reason for it
- any uncertainty caused by missing or conflicting memory

That second prompt is doing hidden architecture work. It nudges the system to separate memory by function instead of treating everything as one blob. Tools like Rephrase are helpful here because they can rewrite rough task instructions into more structured prompts like this without breaking your flow.

What mistakes break agent memory architectures?

Bad agent memory architectures usually fail by storing everything, retrieving the wrong abstraction level, or never revising stale memory. The result is familiar: hallucinated continuity, repeated mistakes, and massive prompt pollution that makes the agent feel forgetful even when it remembers too much. [2][3]

I'd narrow the common mistakes to three.

First, people confuse storage with memory quality. A giant vector database is not a memory architecture. It's just storage.

Second, they over-trust retrieval. StructMemEval is useful here because it shows that some tasks require actual organization, not just recall. Retrieval baselines can look fine on simple fact lookup and still fail badly on structured tasks. [3]

Third, they ignore lifecycle management. The broader survey is blunt about this: memory needs filtering, contradiction handling, consolidation, and forgetting. Otherwise old junk keeps leaking into current decisions. [2]

A practical community tutorial I reviewed made this same point in a more implementation-heavy way. It used salience, novelty thresholds, usage decay, and episodic lessons to avoid storing every raw interaction and repeating the same memory forever. That's not a primary source, but it's a good example of how practitioners are turning the research into workable heuristics. [4]

How should you design memory prompts for agents?

The best memory prompts tell the model what type of memory to write or retrieve, what to ignore, and how to compress the result. If you don't specify that, most LLMs default to vague summarization or brute-force recall, which is usually the wrong behavior. [1][3]

If I'm designing prompts for a memory-aware agent, I usually make the memory contract explicit. I'll ask the system to extract one stable fact, one reusable strategy, and one episode worth preserving. That prevents the model from turning every interaction into a mini essay.

The broader lesson here is useful beyond agent builders. If you're working across browser tabs, IDEs, docs, and chat apps, structured prompting matters just as much as model choice. That's why I like to keep prompts architecture-aware, and why apps like Rephrase feel natural in this workflow: they help turn vague instructions into prompts with clearer retrieval, compression, and output constraints.

Agent memory gets a lot better when you stop asking, "How do I store more?" and start asking, "What kind of memory is this?"

That's the shift. Episodes are evidence. Semantics are facts. Procedures are skills. Once you separate those layers, your agent stops feeling like a chatbot with a scrapbook and starts acting more like a system that actually learns.

References

Documentation & Research

PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents - arXiv cs.CL (link)
Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers - arXiv cs.AI (link)
Evaluating Memory Structure in LLM Agents - arXiv cs.LG (link)

Community Examples 4. How to Build Memory-Driven AI Agents with Short-Term, Long-Term, and Episodic Memory - MarkTechPost (link)

Frequently asked

What are the main memory types in LLM agents?

The core types are episodic memory for specific past interactions, semantic memory for distilled facts and concepts, and procedural memory for reusable strategies or workflows. Strong agent systems usually combine all three instead of relying on raw chat history alone.

How do agents turn episodes into useful knowledge?

A common pattern is to store raw episodes first, then extract stable facts into semantic memory and reusable action patterns into procedural memory. This reduces context bloat and makes future retrieval more targeted.