Discover why per-tenant cost tracking fails at the protocol layer for autonomous agents, and learn practical attribution patterns you can use today.
Most teams want a simple answer: attach a tenant ID to every request and bill it out later. With autonomous agents, that breaks down fast. The agent is not a single request. It is a chain of model calls, tool calls, retries, memory fetches, and sometimes silent background work.
Key Takeaways
Per-tenant tracking sounds easy until one user request causes five model calls and three tool invocations. A protocol can label the first message, but it cannot magically know which later token burst came from the same tenant, which came from shared system logic, or which came from a cached memory lookup. That's why attribution must be rebuilt above the protocol layer [1].
Autonomous agents are execution systems, not single-turn chats. The paper on agentic attribution frames agent behavior as a temporal sequence of observations, actions, memory, and tool outputs, and argues that understanding "why" requires tracing that sequence, not just the final response [1]. That matters for billing too.
Protocols are good at transport. They are bad at semantics. They can carry headers, IDs, and metadata, but they do not know whether a token cost belongs to a user prompt, a planner step, a tool retry, or a delegated sub-agent. In other words, a protocol can preserve labels, but it cannot infer ownership when responsibility becomes distributed.
This is exactly the gap the literature keeps circling. ACAR's authors found that practical attribution requires explicit counterfactual computation, and that proxy signals like response similarity or entropy do not correlate well with ground truth [2]. If attribution itself is hard, cost attribution is not going to be solved by a thinner wire format.
Agents expand one request into many internal events. They may revisit memory, call tools, ask clarification questions, or branch into multiple candidate plans. Each of those steps can consume different models at different rates. So the real billing unit is not the message; it is the trace.
Here's the catch: shared infrastructure makes this worse. A tenant might trigger a global retrieval cache, a shared tool backend, or a platform-level safety check. If you only look at protocol-visible traffic, you miss the hidden work. That's why the right abstraction is execution tracing with tenant propagation, not protocol-level attribution alone [1][2].
The most defensible pattern is layered accounting. First, propagate tenant identity through the orchestrator. Second, stamp every tool call, model call, and retry with a trace ID. Third, aggregate token usage at the step level and roll it up into tenant-level cost. That sounds boring, but boring is what scales.
The research supports this direction. Agentic attribution work uses hierarchical tracing to localize which historical components drove behavior, then narrows to sentence-level evidence [1]. ACAR similarly argues that auditable decision traces are necessary for meaningful measurement and that attribution proxies are too weak to trust on their own [2]. The lesson is simple: if you want fair billing, make the trace explicit.
I like to think about agent billing in three buckets. The first is direct user work: the visible prompt and its immediate response. The second is agent work: planning, retries, memory, and tool execution. The third is platform overhead: safety filters, routing, caching misses, and shared services.
That gives you a clean operating table:
| Cost bucket | Example | Who should see it | Why it matters |
|---|---|---|---|
| Direct user work | Initial prompt and answer | Tenant | Easy to attribute |
| Agent work | Tool retries, planning loops | Tenant/session | Hidden compute dominates |
| Platform overhead | Safety checks, routing, cache misses | Platform or pooled | Not tenant-specific by default |
This is where teams often overfit. They try to force every hidden cost into a tenant bucket. That creates bad incentives and ugly disputes. Sometimes the honest answer is "this was shared platform overhead," not "some tenant caused it."
When I review agent specs, I often see vague instructions like this:
Build an agent that helps customers and keep costs low.
That prompt is too mushy to support attribution. A better version is:
Build a customer-support agent that:
1. tags every turn with tenant_id and session_id
2. logs every model call, tool call, retry, and cache miss
3. records token usage per step
4. rolls step costs into tenant billing at the end of each session
5. separates shared platform overhead from tenant-specific execution
That second version is attribution-friendly because it defines what should be measured. If you want sharper operational prompts like this faster, tools like Rephrase can rewrite rough specs into more precise implementation prompts in a couple of seconds.
If I were shipping this tomorrow, I would not start with protocol changes. I would start with traces, spans, and clear ownership rules. Protocols can standardize transport, but only your app knows which work was tenant-specific and which work was shared. That's the boundary that matters.
I'd also treat attribution as an audit product, not a billing afterthought. The point is not just to invoice correctly. It is to answer hard questions when costs spike: which tenant drove the spend, which agent loop exploded, and which tool chain was responsible. That's the kind of observability that earns trust.
If you want more practical breakdowns like this, the Rephrase blog is where I'd keep digging into prompt and agent workflows.
Documentation & Research
Community Examples
Because protocols can carry metadata, but they cannot reliably explain shared context, tool calls, retries, or downstream model behavior. The attribution problem lives in the application and execution graph, not the wire format.
Agents reuse memory, branch into tools, and can trigger multiple model calls per user request. That means one visible request can fan out into many hidden compute events.
Log tenant identity, session identity, tool invocation IDs, prompt hashes, model names, retry counts, and token usage for each step. Without that, cost allocation gets fuzzy fast.