Learn why KV-cache hit rate drives latency and cost for AI agents, and how stable prefixes turn cache reuse into a real production edge. Try free.
Most agent teams track latency, token usage, and task success. Fair enough. But if you only watch those numbers, you can miss the metric that explains all three.
KV-cache hit rate is the percentage of a request's prompt computation that the model can reuse from prior work instead of recomputing token by token. In practice, it is a reuse metric for attention state. When it is high, agents get faster and cheaper. When it is low, every turn behaves like a cold start.[1][2]
Here's the simple version. Transformer inference has two broad phases: prefill and decode. Prefill processes the input context and builds the key-value cache. Decode generates the next tokens using that cache. If your next request starts with the same prefix as the previous one, the system can often reuse the cached prefix instead of rebuilding it.[2]
That is why I think KV-cache hit rate is the production metric. It sits upstream of latency, throughput, and often even quality. If your agents keep cold-starting long contexts, they get slower, more expensive, and more likely to drown in irrelevant history.[2][3]
KV-cache hit rate beats most operational metrics because it explains both cost and latency at the point where inference work is created. Token counts tell you volume. Hit rate tells you reuse. In agents with long, repeated prefixes, reuse is where the real leverage lives.[1][2]
The strongest evidence comes from systems papers, not Twitter takes. TVCACHE shows that when repeated tool trajectories are cached correctly, median tool-call execution time dropped by up to 6.9x with cache hit rates up to 70%.[1] A separate paper on persistent KV cache for multi-agent inference reports time-to-first-token reductions from tens or even hundreds of seconds down to sub-second or low-second reload times depending on context size and cache state.[2]
Here's what I notice in production teams: they celebrate shaving 10% off prompt length while quietly invalidating the prefix on every turn. That is backwards. A slightly longer prompt with a stable reusable prefix is often better than a shorter prompt that busts the cache every time.
| Metric | What it tells you | What it misses |
|---|---|---|
| Token count | How much context you send | Whether any of it gets reused |
| Latency | End-user wait time | Why the request was slow |
| Cost per run | Spend per workflow | Whether structure can reduce recompute |
| Success rate | Task outcome | Infrastructure waste |
| KV-cache hit rate | Reuse of expensive prefill work | Needs context from other metrics |
KV-cache hit rate rises when prefixes stay stable and falls when early prompt tokens change unnecessarily. The biggest killers are dynamic metadata, reordered tool definitions, non-deterministic serialization, and rewriting prior context instead of appending new context.[1][4]
The community source here is practical and useful: Bustamante's write-up calls out stable prefixes and append-only context as the biggest production optimization for agent systems.[4] That matches the papers. TVCACHE relies on longest-prefix matching over tool histories because exact prior state is what makes reuse safe.[1] The persistent agent-memory paper also depends on monotonic prompt extension and prefix matching to reuse cache across phases.[2]
A few patterns matter more than people think:
Put long-lived instructions, policies, examples, and tool schemas first. Put volatile data later. If the first 2,000 tokens stay identical, that is gold.
Do not rewrite previous turns if you can avoid it. Once you mutate earlier tokens, you invalidate everything after that point.[4]
If your orchestration layer serializes tools or state differently on each run, you are manufacturing cache misses.
Changing available tools mid-run may help reasoning, but it can wreck reuse. Static definitions with runtime constraints are often the better trade-off.[1][4]
You improve KV-cache hit rate by treating prompt structure like infrastructure, not prose. The right move is to design for prefix stability, deterministic state, and selective context growth. This is an engineering problem first and a prompt-writing problem second.[1][2][4]
Here is the workflow I'd use:
A quick before-and-after makes this concrete:
Before
System: You are a helpful coding agent.
Timestamp: 2026-04-17T14:03:11.482Z
Available tools: grep, bash, read_file
Current task summary: ...
Conversation summary: ...
After
System: You are a helpful coding agent.
Available tools: bash, grep, read_file
Core operating rules: ...
Examples: ...
Conversation summary: ...
Current task summary: ...
Timestamp: 2026-04-17
That tiny change can be the difference between a warm prefix and a full recompute. If you want to clean up this kind of prompt structure quickly across apps, tools like Rephrase can help standardize and sharpen prompt text before it reaches your model layer. It will not fix orchestration bugs, but it does reduce the mess humans introduce.
A good KV-cache strategy combines high prefix reuse with safe state management. The best systems do not just cache blindly. They reuse only when prior state is compatible, and they design prompts so compatible state happens often.[1][2][3]
This is where the nuance matters. Not every cache hit is safe. The paper on agent caching failures is a useful warning: caching works only when keys are consistent and hits are precise. Bad cache design can produce unsafe reuse, especially in agentic workflows with stateful tools.[3]
So my rule is simple: maximize reuse, but never fake equivalence.
| Strategy | Benefit | Risk |
|---|---|---|
| Stable shared prefix | High hit rate, lower TTFT | Requires strict formatting discipline |
| Append-only memory | Better reuse across turns | Can grow context too fast |
| Persistent KV cache | Huge TTFT reduction after eviction/reload | More system complexity |
| Semantic or intent caching | Can skip full LLM work entirely | Unsafe if keys are inconsistent |
The interesting part is that these techniques stack. A stable prefix improves model-side KV reuse. Intent or response caching can skip whole requests. Persistent KV storage helps when memory pressure forces eviction. Different layers. Same principle: don't recompute what you already know.
For more articles on prompt and agent infrastructure, the Rephrase blog is worth browsing if you care about turning messy prompts into repeatable systems.
Teams should use KV-cache hit rate as a diagnostic metric tied to routes, agents, and prompt templates, not as a vanity number. It becomes useful when you compare it with TTFT, token growth, and workflow shape. The point is to find broken structure fast.[1][2]
I would put it on the same dashboard as:
If a release hurts hit rate, assume something structural changed. Maybe a timestamp moved to the top. Maybe a JSON key order changed. Maybe a summarizer rewrote the running prefix. Those are fixable. That is the good news.
And yes, this is exactly the kind of thing that gets ignored because it does not sound product-facing. But it is. Fast agents feel smarter. Cheap agents scale further. Reused context is one of the cleanest ways to get both.
If you are constantly rewriting prompts by hand, Rephrase is a nice companion for the human side of the workflow. Just remember: better wording helps, but better cacheability compounds.
Documentation & Research
Community Examples 5. The LLM Context Tax: Best Tips for Tax Avoidance - Hacker News (LLM) / Nicolas Bustamante (link)
KV-cache hit rate measures how often an LLM can reuse previously computed attention state instead of recomputing the prompt from scratch. Higher hit rates usually mean lower latency and lower cost.
The biggest lever is keeping the prompt prefix stable across requests. That means append-only context, deterministic serialization, and moving volatile fields like timestamps to the end.