Blog / Tutorials / How to Cut LLM API Costs by 80%

How to Cut LLM API Costs by 80%

Learn how to match the right AI model to each task, avoid hidden LLM costs, and build a routing strategy that slashes API spend. Try free.

Ilia Ilinskii
Rephrase · April 24, 2026

Tutorials8 min read

On this page

Key Takeaways Why does picking the wrong model get so expensive?What should you optimize instead of list price?How do hidden reasoning tokens break your cost estimates?How can you match capability to task in practice?What does a before-and-after routing strategy look like?Before After Why do teams still overspend after they add routing?References

Most teams don't overspend on AI because their prompts are bad. They overspend because they use one expensive model for everything.

That sounds harmless at prototype stage. In production, it gets brutal.

Key Takeaways

Listed API prices are a weak proxy for real cost because reasoning models can burn wildly different numbers of hidden thinking tokens [1].
The biggest savings usually come from matching model capability to task difficulty, not from shaving a few tokens off prompts [2].
Research on agent workflows found 13× to 32× cost gaps between model choices at similar quality, which is massive enough to dwarf most infra optimizations [2].
Adaptive routing can preserve quality while cutting spend dramatically, especially when only a minority of requests truly need top-tier reasoning [3].
You should optimize for quality, latency, and cost together, using representative evals instead of provider pricing tables alone [1][2].

Why does picking the wrong model get so expensive?

The wrong model gets expensive because API price per token is only one part of the bill. Real cost depends on how many tokens a model actually uses, how much hidden reasoning it performs, and whether that level of capability was necessary for the task in the first place [1].

Here's the trap I see all the time: teams compare pricing pages, notice one model is "cheap," and assume it will stay cheap in production. That assumption breaks fast. A recent paper on reasoning models found a "pricing reversal" in 21.8% of model-pair comparisons, meaning the model with the lower listed price ended up costing more in real workloads [1]. In the most extreme case, the reversal reached 28× [1].

That matters because many modern apps are full of tasks that don't need deep reasoning at all. If you route lightweight jobs like classification, extraction, rewriting, guardrails, or simple Q&A to a flagship reasoning model, you're paying premium rates for work a smaller model could handle just fine.

What should you optimize instead of list price?

You should optimize for the actual tradeoff between quality, latency, and total cost on your workload. The winning model is not the cheapest on paper or the strongest in a benchmark. It is the one that gives acceptable output quality for that specific task at the lowest real end-to-end cost [2].

This is where the research gets useful. The AgentOpt technical report argues that model selection is the first-order optimization problem in agent systems, ahead of caching or scheduling, because it determines the whole cost structure upstream [2]. Across their benchmarks, the gap between the best and worst model combinations at comparable quality ranged from 13× to 32× [2].

That's the headline most teams miss. If your model choice is wrong, infra tuning won't save you.

A practical way to think about it is this:

Task type	What usually matters most	Best model choice tendency
Classification, tagging, extraction	Cost, speed, consistency	Small or mini models
Rewrite, summarization, formatting	Speed and acceptable quality	Mid-tier models
Multi-step reasoning, planning, code changes	Quality and robustness	Strong reasoning models
Agent orchestration or tool-heavy flows	Workflow fit, not just raw IQ	Role-specific model mix

What's interesting is that the "best model" can change by role. AgentOpt shows that a powerful model can be great as a solver but poor as a planner because it behaves in a way that breaks the workflow [2]. So the question is not "What's our best model?" It's "What's our best model for this exact subtask?"

How do hidden reasoning tokens break your cost estimates?

Hidden reasoning tokens break cost estimates because they are billed like output tokens but are often invisible in casual comparisons. If one model thinks 10× longer than another on the same prompt, its lower token price may not matter at all [1].

The pricing-reversal paper makes this painfully clear. The authors found that thinking tokens dominate actual cost across many reasoning models, and that one model may use 900% more thinking tokens than another on the same query [1]. In one AIME example, GPT-5.2 used 562 thinking tokens while Gemini 3 Flash used more than 11,000 to reach the same answer, making the cheaper-listed model 2.5× more expensive on that task [1].

This is why finance teams hate LLM bills. The unit economics look neat in a spreadsheet, then reality shows up with stochastic internal reasoning and variable token burn.

My take: if you're using reasoning-heavy models, stop forecasting from pricing pages alone. Forecast from observed request traces.

How can you match capability to task in practice?

You can match capability to task by splitting your workload into task buckets, evaluating each bucket separately, and routing requests to the cheapest model that clears your quality bar. This beats the lazy "one model for everything" setup almost every time [2][3].

Here's the simple workflow I recommend:

Define 3 to 5 task buckets. For example: extraction, rewrite, support Q&A, code generation, and hard reasoning.
Build a small eval set for each bucket. Twenty to fifty representative prompts is often enough to reveal big cost gaps.
Test multiple models on each bucket. Measure output quality, latency, and actual cost per successful result.
Set a pass threshold. Don't ask which model is best; ask which is good enough.
Route by default to the cheapest passing model. Escalate only when confidence is low or the task is clearly difficult.

That last part is where the savings usually appear. HotelQuEST, a benchmark on agentic search, found that only a minority of hard queries really needed the most powerful agents. Their oracle results showed near-optimal accuracy could be reached at a fraction of the cost, with a budget oracle at $1 outperforming all individual agents while costing 96× less than one expensive baseline [3].

That's not a small optimization. That's a product decision.

What does a before-and-after routing strategy look like?

A good routing strategy replaces "everything goes to the smartest model" with a tiered path where simple requests stay cheap and hard ones escalate only when necessary. That is how teams get dramatic savings without wrecking UX [2][3].

Here's a stripped-down example.

Before

System behavior:
- Send every user request to a flagship reasoning model
- Same model handles extraction, summaries, support replies, planning, and edge cases
- No fallback or escalation logic

After

Routing policy:
- Extraction, classification, and formatting -> small model
- Summaries and routine support replies -> mid-tier model
- Complex code, planning, ambiguous requests, or failed first pass -> flagship model
- If confidence is low, escalate automatically

And here's the impact pattern I keep noticing:

Setup	Quality	Latency	Cost
One flagship model for all tasks	High	Medium to high	Very high
Tiered routing by task	High on hard tasks, good enough on easy ones	Usually better	Much lower
Role-based workflow mix	Often best overall	Varies	Lowest at target quality

If you write prompts manually across apps, a tool like Rephrase can help standardize the input before it reaches your router, which makes your evals cleaner and your routing more reliable. That's especially helpful when the same intent starts in Slack, an IDE, or a browser.

Why do teams still overspend after they add routing?

Teams still overspend after adding routing because they route on intuition instead of evidence. They also ignore stopping behavior, repeated tool calls, and role mismatch inside multi-step workflows [2][3].

HotelQuEST found a recurring pattern in agent systems: over-exploration. Agents kept making tool calls after they already had enough evidence, which increased cost and latency without improving accuracy [3]. AgentOpt found something related from another angle: strong standalone models can be the wrong fit for a workflow role, so a smarter model in isolation may still be the costlier choice in context [2].

That's why cost control isn't just model routing. It's evaluation plus routing plus observability.

At minimum, track cost per request, cost per successful outcome, and which model handled each request class. If you can't answer "Which 10% of endpoints generate 80% of spend?" you're not really optimizing yet.

For more practical workflows on prompt quality and AI usage patterns, browse the Rephrase blog. And if you want a faster way to clean up prompts before they hit your model stack, Rephrase for macOS is a nice shortcut.

The real cost of picking the wrong model isn't a few extra cents. It's building your whole product on the assumption that maximum capability should be the default.

It shouldn't.

Start with the cheapest model that can do the job. Prove where stronger models are necessary. Then route everything else down. That's how you cut costs by 80% without turning your app into a worse product.

References

Documentation & Research

The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More - arXiv cs.CL (link)
AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent - arXiv cs.LG (link)
HotelQuEST: Balancing Quality and Efficiency in Agentic Search - arXiv cs.AI (link)

Community Examples 4. I compared 8 AI coding models on the same real-world feature in an open-source TypeScript project. Here are the results - r/LocalLLaMA (link)

Frequently asked

How do I choose the right AI model for a task?

Start by grouping tasks by difficulty, latency tolerance, and quality requirements. Then test a small representative set of prompts across multiple models and compare actual output quality, latency, and cost instead of relying on pricing pages alone.

What is model routing in AI apps?

Model routing means sending each request to the model that best fits that task's needs. Simple classification, extraction, or rewrite tasks can go to smaller models, while harder reasoning tasks go to stronger ones.