Most agent teams track latency, token usage, and task success. Fair enough. But if you only watch those numbers, you can miss the metric that explains all three.
Key Takeaways
- KV-cache hit rate is often the clearest leading indicator of agent latency and cost.
- Stable, append-only prompt prefixes are what make cache reuse possible in production.
- Research shows cache reuse can cut time-to-first-token by orders of magnitude in some agent workflows.[1][2]
- A low hit rate usually points to orchestration mistakes, not model quality.
- Before you rewrite prompts manually, fix the structure around them.
What is KV-cache hit rate, really?
KV-cache hit rate is the percentage of a request's prompt computation that the model can reuse from prior work instead of recomputing token by token. In practice, it is a reuse metric for attention state. When it is high, agents get faster and cheaper. When it is low, every turn behaves like a cold start.[1][2]
Here's the simple version. Transformer inference has two broad phases: prefill and decode. Prefill processes the input context and builds the key-value cache. Decode generates the next tokens using that cache. If your next request starts with the same prefix as the previous one, the system can often reuse the cached prefix instead of rebuilding it.[2]
That is why I think KV-cache hit rate is the production metric. It sits upstream of latency, throughput, and often even quality. If your agents keep cold-starting long contexts, they get slower, more expensive, and more likely to drown in irrelevant history.[2][3]
Why does KV-cache hit rate beat most other metrics?
KV-cache hit rate beats most operational metrics because it explains both cost and latency at the point where inference work is created. Token counts tell you volume. Hit rate tells you reuse. In agents with long, repeated prefixes, reuse is where the real leverage lives.[1][2]
The strongest evidence comes from systems papers, not Twitter takes. TVCACHE shows that when repeated tool trajectories are cached correctly, median tool-call execution time dropped by up to 6.9x with cache hit rates up to 70%.[1] A separate paper on persistent KV cache for multi-agent inference reports time-to-first-token reductions from tens or even hundreds of seconds down to sub-second or low-second reload times depending on context size and cache state.[2]
Here's what I notice in production teams: they celebrate shaving 10% off prompt length while quietly invalidating the prefix on every turn. That is backwards. A slightly longer prompt with a stable reusable prefix is often better than a shorter prompt that busts the cache every time.
| Metric | What it tells you | What it misses |
|---|---|---|
| Token count | How much context you send | Whether any of it gets reused |
| Latency | End-user wait time | Why the request was slow |
| Cost per run | Spend per workflow | Whether structure can reduce recompute |
| Success rate | Task outcome | Infrastructure waste |
| KV-cache hit rate | Reuse of expensive prefill work | Needs context from other metrics |
What actually drives KV-cache hit rate up or down?
KV-cache hit rate rises when prefixes stay stable and falls when early prompt tokens change unnecessarily. The biggest killers are dynamic metadata, reordered tool definitions, non-deterministic serialization, and rewriting prior context instead of appending new context.[1][4]
The community source here is practical and useful: Bustamante's write-up calls out stable prefixes and append-only context as the biggest production optimization for agent systems.[4] That matches the papers. TVCACHE relies on longest-prefix matching over tool histories because exact prior state is what makes reuse safe.[1] The persistent agent-memory paper also depends on monotonic prompt extension and prefix matching to reuse cache across phases.[2]
A few patterns matter more than people think:
Stable prefixes
Put long-lived instructions, policies, examples, and tool schemas first. Put volatile data later. If the first 2,000 tokens stay identical, that is gold.
Append-only context
Do not rewrite previous turns if you can avoid it. Once you mutate earlier tokens, you invalidate everything after that point.[4]
Deterministic formatting
If your orchestration layer serializes tools or state differently on each run, you are manufacturing cache misses.
Tool discipline
Changing available tools mid-run may help reasoning, but it can wreck reuse. Static definitions with runtime constraints are often the better trade-off.[1][4]
How can you improve KV-cache hit rate in production agents?
You improve KV-cache hit rate by treating prompt structure like infrastructure, not prose. The right move is to design for prefix stability, deterministic state, and selective context growth. This is an engineering problem first and a prompt-writing problem second.[1][2][4]
Here is the workflow I'd use:
- Audit your first 500 to 2,000 prompt tokens across consecutive turns.
- Identify every changing field near the top: timestamps, IDs, tool order, session summaries, transient flags.
- Move dynamic fields as far down as possible.
- Make context append-only unless a rewrite is absolutely required.
- Serialize JSON, tool specs, and memory blocks deterministically.
- Track hit rate per route, agent, and tool-enabled mode.
A quick before-and-after makes this concrete:
Before
System: You are a helpful coding agent.
Timestamp: 2026-04-17T14:03:11.482Z
Available tools: grep, bash, read_file
Current task summary: ...
Conversation summary: ...
After
System: You are a helpful coding agent.
Available tools: bash, grep, read_file
Core operating rules: ...
Examples: ...
Conversation summary: ...
Current task summary: ...
Timestamp: 2026-04-17
That tiny change can be the difference between a warm prefix and a full recompute. If you want to clean up this kind of prompt structure quickly across apps, tools like Rephrase can help standardize and sharpen prompt text before it reaches your model layer. It will not fix orchestration bugs, but it does reduce the mess humans introduce.
What does a good KV-cache strategy look like in practice?
A good KV-cache strategy combines high prefix reuse with safe state management. The best systems do not just cache blindly. They reuse only when prior state is compatible, and they design prompts so compatible state happens often.[1][2][3]
This is where the nuance matters. Not every cache hit is safe. The paper on agent caching failures is a useful warning: caching works only when keys are consistent and hits are precise. Bad cache design can produce unsafe reuse, especially in agentic workflows with stateful tools.[3]
So my rule is simple: maximize reuse, but never fake equivalence.
| Strategy | Benefit | Risk |
|---|---|---|
| Stable shared prefix | High hit rate, lower TTFT | Requires strict formatting discipline |
| Append-only memory | Better reuse across turns | Can grow context too fast |
| Persistent KV cache | Huge TTFT reduction after eviction/reload | More system complexity |
| Semantic or intent caching | Can skip full LLM work entirely | Unsafe if keys are inconsistent |
The interesting part is that these techniques stack. A stable prefix improves model-side KV reuse. Intent or response caching can skip whole requests. Persistent KV storage helps when memory pressure forces eviction. Different layers. Same principle: don't recompute what you already know.
For more articles on prompt and agent infrastructure, the Rephrase blog is worth browsing if you care about turning messy prompts into repeatable systems.
How should teams use this metric day to day?
Teams should use KV-cache hit rate as a diagnostic metric tied to routes, agents, and prompt templates, not as a vanity number. It becomes useful when you compare it with TTFT, token growth, and workflow shape. The point is to find broken structure fast.[1][2]
I would put it on the same dashboard as:
- time-to-first-token
- median prompt length
- tool-call count
- cache hit rate by agent step
- cache miss reasons if your stack can expose them
If a release hurts hit rate, assume something structural changed. Maybe a timestamp moved to the top. Maybe a JSON key order changed. Maybe a summarizer rewrote the running prefix. Those are fixable. That is the good news.
And yes, this is exactly the kind of thing that gets ignored because it does not sound product-facing. But it is. Fast agents feel smarter. Cheap agents scale further. Reused context is one of the cleanest ways to get both.
If you are constantly rewriting prompts by hand, Rephrase is a nice companion for the human side of the workflow. Just remember: better wording helps, but better cacheability compounds.
References
Documentation & Research
- TVCACHE: A Stateful Tool-Value Cache for Post-Training LLM Agents - arXiv cs.LG (link)
- Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices - arXiv cs.LG (link)
- Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning - arXiv cs.CL (link)
- Learning to Evict from Key-Value Cache - arXiv cs.CL (link)
Community Examples 5. The LLM Context Tax: Best Tips for Tax Avoidance - Hacker News (LLM) / Nicolas Bustamante (link)
-0364.png&w=3840&q=75)

-0365.png&w=3840&q=75)
-0363.png&w=3840&q=75)
-0362.png&w=3840&q=75)
-0349.png&w=3840&q=75)