Discover why cost per completed task is a better AI metric than cost per token in 2026, with research-backed examples and practical guidance. Try free.
Most AI teams still shop models like they're buying electricity: compare unit price, pick the cheapest, and assume the bill will work out. In May 2026, that mindset is wrong often enough to hurt.
Cost per completed task is the average amount you spend to get a real success, not just to generate tokens. It rolls price, token usage, retries, and success rate into one number, which is why it maps better to product reality than raw token pricing.
Here's the plain version I use:
Cost per completed task = total spend / number of successfully completed tasks
That sounds obvious. The catch is that most teams don't optimize for it. They optimize for cost per 1M input tokens, or cost per request, or a provider's advertised "cheap" tier. Those are procurement metrics. They are not outcome metrics.
If model A costs half as much per token as model B, but needs more retries, uses more hidden reasoning tokens, or fails more often, model A can easily be more expensive per useful result. That is not theory anymore. It's showing up clearly in 2026 research [1][2].
Cost per token is incomplete because it ignores token efficiency, success probability, and the shape of real workloads. It tells you the price of computation units, but not how many units a model will burn or whether those units create business value.
The clearest evidence comes from the "price reversal" paper. Across eight frontier reasoning models and nine task sets, researchers found that in 21.8% of model-pair comparisons, the model with the lower listed API price actually had the higher real cost [1]. In the worst cases, the reversal reached 28x [1].
That is a huge deal. It means the sticker price on the model page is not a reliable proxy for what you will actually spend.
Why does this happen? Mostly because models vary dramatically in thinking token usage. In one example from the paper, Gemini 3 Flash looked much cheaper on paper than GPT-5.2, but consumed far more thinking tokens on the same problem and ended up costing more overall [1].
So when someone says, "This model is cheaper," my first question now is: cheaper at what? Per token? Per request? Or per solved task?
Agent workflows amplify the gap between token price and real cost because they repeatedly read context, call tools, and loop through long trajectories. In these systems, input accumulation and retries can dominate billing even when output tokens look modest.
A 2026 study on agentic coding tasks found that agent workflows consume vastly more tokens than ordinary chat or one-shot reasoning. On average, agentic coding used 1000x more tokens than code reasoning and was driven primarily by input tokens rather than output tokens [2].
That same paper found three details that matter a lot for budgeting:
First, token usage is highly variable. The same task can cost radically different amounts across runs, and some repeated runs differed by up to 30x in total tokens [2].
Second, more tokens did not lead to better success. Accuracy often peaked at intermediate cost and then flattened or degraded at higher spend levels [2].
Third, expensive runs often came from repeated file reads, repeated edits, and other redundant behavior, not from smarter reasoning [2].
That's why cost per completed task is the better lens. It captures the ugly truth: a long, expensive agent trace that fails is not "more work." It's just more waste.
Teams should compare models on actual workload outcomes, using the same tasks, same success criteria, and the same billing rules. The goal is not to find the cheapest model on paper. It is to find the model with the best reliability-adjusted cost.
Here's the table I'd use internally:
| Metric | What it tells you | Why it matters |
|---|---|---|
| Cost per token | Unit price of inference | Good for rough budgeting, weak for model choice |
| Cost per request | Average price of one call | Useful for simple apps, weak for agents |
| Success rate | Share of tasks truly completed | Critical, but incomplete without cost |
| Cost per completed task | Spend required for one successful outcome | Best single metric for product decisions |
| Cost of failure | Spend lost on unsuccessful runs | Essential for agent and tool workflows |
What I noticed is that most teams already have the logs needed for this. They just don't roll them up the right way. If you can measure request cost and task success, you can measure the metric that matters.
A related paper on agent benchmarking makes the same broader point from another angle: evaluation itself is expensive, and per-task costs vary wildly, with SWE-bench Verified runs ranging from $0.08 to $32 per task depending on model and scaffold [3]. That spread alone should kill the idea that "token price" is enough.
You improve cost per completed task by reducing wasted context, tightening prompts, choosing models by workload, and measuring failures aggressively. The best gains usually come from workflow design, not just model switching.
Here's a simple before-and-after prompt example:
Before:
Fix this bug in my repo.
After:
You are debugging a Python repository.
Goal: identify the root cause of the failing authentication flow and propose the smallest safe fix.
Constraints: do not refactor unrelated modules, list assumptions, and stop after proposing one primary fix plus one fallback.
Success criteria: tests for auth pass, no changes outside auth-related files unless required.
The second prompt does three useful things. It narrows scope. It defines success. And it reduces wandering. That matters because wandering is expensive.
This is also where prompt tooling earns its keep. If your team writes prompts all day in Slack, your IDE, or docs, Rephrase can turn rough instructions into tighter, skill-specific prompts in a couple of seconds. It won't solve model economics by itself, but it can cut a surprising amount of ambiguity before the tokens start flowing.
For deeper reading on practical prompting workflows, I'd also point people to the Rephrase blog, because the fastest way to lower cost per completed task is often to ask better in the first place.
You should report cost per completed task alongside success rate and total spend, because leaders care about delivered outcomes, not token trivia. This framing makes AI costs legible to product, finance, and operations teams at the same time.
If you tell a VP, "Model X is $3 cheaper per million tokens," that sounds technical but not meaningful. If you tell them, "Model X delivers one successful ticket resolution for $0.84 versus $1.37 on Model Y," that is instantly clear.
There's also a strategic benefit. Once you report outcome-based metrics, prompt quality, tool design, retry policy, and model selection all become part of the same optimization loop. That's the right loop.
The Reddit chatter around newer usage-based pricing shifts is noisy, but one thing it gets right is this: the flat-rate fantasy is ending, and heavy agent use is being repriced closer to real consumption. Teams without outcome-level visibility are going to get surprised [4].
The short version is simple. Cost per token is a component. Cost per completed task is the business metric. In 2026, if you're still choosing models by sticker price alone, you're probably optimizing the wrong line on the spreadsheet.
If you want a practical next step, take one production workflow, run the same 50 tasks across two or three model setups, and compute the real number: dollars spent per successful completion. Once you see that number, it's hard to go back.
Documentation & Research
Community Examples 5. Copilot just 9x'd Sonnet and 27x'd Opus and teams have no idea - r/ChatGPT (link)
Cost per completed task measures how much you spend, on average, to get a task successfully finished. It combines both price and success rate, which makes it more useful than looking at token pricing alone.
A practical formula is total spend divided by successful completions. If you want to compare models fairly, run the same task set, track actual billed cost, and divide by the number of tasks that truly pass.