Learn how reasoning_effort became the shared AI API UX across Mistral, Anthropic, and OpenAI, with examples for cost-aware prompts. Read the full guide.
The most interesting AI API trend right now is not a bigger context window or another benchmark win. It is that Mistral, Anthropic, and OpenAI all landed on the same product idea: make reasoning a per-request setting.
Per-request reasoning_effort is an inference-time setting that controls how much reasoning a model should spend on a specific call. Instead of choosing a "fast model" for simple tasks and a "reasoning model" for hard tasks, developers can often keep one model endpoint and vary effort per request.
That sounds small. It is not. It changes the architecture of AI products.
The old pattern was model routing. You sent cheap tasks to a small model, hard tasks to a bigger model, and hoped your router knew the difference. The new pattern is behavior routing inside the same model family. The request itself says, in effect: answer fast, think a little, or think hard.
Researchers working on ARES describe this as a more granular alternative to multi-model routing. Their paper notes that many modern LLMs now support configurable levels such as low, medium, and high, and that agents should reserve high reasoning for difficult steps while using lower effort for straightforward actions [1].
Here is the rough product translation: reasoning is becoming a runtime budget, not a model identity.
They converged because reasoning models created the same developer pain for everyone: better answers often cost more time and tokens. A per-request effort knob gives product teams a simple, legible way to trade quality against latency and cost without redesigning their model stack.
Mistral's move is the cleanest example. Reports on Mistral Small 4 describe a single model that combines instruction following, reasoning, multimodal input, and agentic coding. More importantly, it exposes a per-request reasoning_effort parameter, with none positioned for fast chat-like responses and high for deliberate reasoning [5].
Anthropic's pattern is similar, even if the naming differs. Claude Opus 4.6 coverage describes explicit effort controls with low, medium, high, and max levels, plus adaptive thinking so the model can decide when extended reasoning is needed [6].
OpenAI has been pushing the same mental model through its reasoning APIs: a reasoning.effort configuration that developers can pin per call. A recent LLM code-smell taxonomy even treats "reasoning effort not explicitly set" as a reliability and maintainability problem for reasoning-capable systems [3].
The UX convergence is obvious: expose reasoning depth as a first-class request parameter.
| Provider | Control surface | Developer mental model | Best use |
|---|---|---|---|
| Mistral | reasoning_effort |
One model, vary depth per call | Fast chat through deep reasoning |
| Anthropic | thinking / effort controls | Trade speed, cost, and depth | Coding agents, research, long workflows |
| OpenAI | reasoning.effort |
Explicit reasoning budget tier | Production calls with reproducible settings |
The names differ. The product idea is the same.
Research supports the idea that effort control matters, especially for agents, but it also warns against treating effort as a perfect intelligence dial. Reasoning levels can reduce token cost when routed well, yet higher effort does not always mean better output or deeper real-time cognition.
The ARES paper is the most practical source here. It trains a lightweight router to choose low, medium, or high reasoning effort per agent step. Across tool-use, deep-research, and web-navigation tasks, ARES reduced reasoning token usage by up to 52.7% compared with fixed high effort while preserving most task success [1].
That is the strongest argument for this UX: not every step deserves premium thinking.
But there is a catch. Another 2026 paper, "Effort as Ceiling, Not Dial," tested GPT-OSS models across low, medium, and high settings. It found that effort changed trace length only modestly and did not systematically change cognitive-cost alignment with humans. The authors interpret the parameter more like an upper budget than a real-time compute dial [2].
A third study comparing "instant" versus "thinking" modes across providers found that reasoning-token spend varied dramatically by model family. Claude averaged 33 reasoning tokens per call in the study's thinking setup, while Qwen averaged 2,639. The paper's warning is important: "thinking mode" is not a standardized construct across providers [4].
My take: reasoning_effort is a production control, not a guarantee. It gives you a better contract with the model. It does not make every hard problem solved.
Developers should set reasoning_effort explicitly on every reasoning-capable call and choose the lowest level that reliably solves the task. Use low or none for formatting, extraction, classification, and simple replies. Use high for planning, code edits, ambiguous decisions, and agent recovery.
Here is the pattern I would use in a real product.
If the task is deterministic, short, and low-risk:
reasoning_effort = "low" or "none"
If the task requires synthesis, comparison, or light debugging:
reasoning_effort = "medium"
If the task involves multi-step planning, code changes, financial/legal reasoning, or agent recovery:
reasoning_effort = "high"
The important part is not the exact label. Providers will keep changing labels. The important part is to treat reasoning effort as part of your product contract, like model version, output schema, and max output tokens.
The LLM code-smells paper makes this point directly. If a reasoning-capable model is invoked without explicit reasoning configuration, the system silently relies on provider defaults. Those defaults can differ by model or change over time, which harms reproducibility, maintainability, and reliability [3].
That is why I would log effort next to every request:
{
"model": "reasoning-model-name",
"task_type": "code_review",
"reasoning_effort": "high",
"max_output_tokens": 1200,
"schema": "CodeReviewFinding[]"
}
If you write prompts manually, this is also where tools like Rephrase help. You can draft the task in plain English, then use Rephrase to turn it into a tighter prompt with clearer scope, constraints, and output format before choosing the reasoning level.
Prompt engineering is shifting from "write better instructions" to "write better request contracts." The prompt still matters, but model, reasoning effort, output format, context budget, and task routing now matter just as much for reliable AI behavior.
Here is a before-and-after example.
Before:
Review this pull request and tell me if anything looks wrong.
After:
You are reviewing a pull request for production readiness.
Task:
Identify correctness, security, and maintainability issues.
Constraints:
Focus only on issues that could affect runtime behavior or future maintenance.
Ignore style-only comments unless they hide a real bug.
Return findings as JSON with: file, line, severity, issue, suggested_fix.
Use high reasoning effort because this review requires multi-step code analysis.
Notice what changed. The improved version does not just ask for "better reasoning." It defines the job, narrows the review surface, specifies the output contract, and explains why high effort is justified.
For a simpler request, do the opposite.
Rewrite this Slack update to be concise and friendly.
Use low reasoning effort. Preserve the facts. Do not add new commitments.
That is the new muscle: matching reasoning depth to task risk. For more examples like this, the Rephrase blog covers practical prompt patterns across coding, writing, and AI tools.
The next step is adaptive reasoning by default. Developers will still pin effort for sensitive workflows, but most agent systems will learn to route effort per step, using high effort only when the current state is complex, risky, or uncertain.
ARES is basically a preview of that future. It shows that a small router can decide effort levels from interaction history, reducing tokens while keeping task performance close to high-effort baselines [1]. Anthropic's adaptive thinking points in the same direction at the product level [6]. Mistral's single-model, per-request reasoning_effort points there too [5].
The convergence matters because it gives teams a shared abstraction. Whether you build on Mistral, Anthropic, or OpenAI, you can now design around the same question:
How much should the model think for this request?
That question belongs in your prompt workflow, your API wrapper, your logs, and your evals. If you ignore it, you are letting the provider choose your latency, cost, and reasoning behavior for you.
Try this today: take five common AI calls in your product and label them low, medium, or high effort. Then test whether the expensive ones actually need it. You will probably find waste. You may also find a few places where you were under-thinking the important work.
Documentation & Research
Community & Industry Examples
reasoning_effort is a per-request control that tells a reasoning model how much internal thinking budget to use. Higher effort usually improves hard-task reliability but increases latency and cost.
They expose similar UX but not identical semantics. OpenAI uses reasoning effort tiers, Anthropic exposes thinking or effort controls, and Mistral Small 4 exposes per-request reasoning_effort levels.