Blog / Prompt engineering / reasoning_effort Is the New AI API UX

reasoning_effort Is the New AI API UX

Learn how reasoning_effort became the shared AI API UX across Mistral, Anthropic, and OpenAI, with examples for cost-aware prompts. Read the full guide.

Ilia Ilinskii
Rephrase · May 29, 2026

Prompt engineering6 min read

On this page

Key Takeaways What is per-request reasoning_effort?Why did Mistral, Anthropic, and OpenAI converge on the same UX?What does research say about reasoning effort?How should developers set reasoning_effort in production?What does this mean for prompt engineering?Where does the UX go next?References

The most interesting AI API trend right now is not a bigger context window or another benchmark win. It is that Mistral, Anthropic, and OpenAI all landed on the same product idea: make reasoning a per-request setting.

Key Takeaways

Per-request reasoning_effort is becoming the default UX for reasoning-capable AI APIs.
Mistral, Anthropic, and OpenAI converged on the same control surface: choose less or more thinking at call time.
Research suggests effort controls are useful, but they are not magic dials; provider semantics differ.
Production teams should explicitly set reasoning effort instead of relying on defaults.
The best pattern is adaptive: low effort for obvious tasks, higher effort for risky or complex steps.

What is per-request reasoning_effort?

Per-request reasoning_effort is an inference-time setting that controls how much reasoning a model should spend on a specific call. Instead of choosing a "fast model" for simple tasks and a "reasoning model" for hard tasks, developers can often keep one model endpoint and vary effort per request.

That sounds small. It is not. It changes the architecture of AI products.

The old pattern was model routing. You sent cheap tasks to a small model, hard tasks to a bigger model, and hoped your router knew the difference. The new pattern is behavior routing inside the same model family. The request itself says, in effect: answer fast, think a little, or think hard.

Researchers working on ARES describe this as a more granular alternative to multi-model routing. Their paper notes that many modern LLMs now support configurable levels such as low, medium, and high, and that agents should reserve high reasoning for difficult steps while using lower effort for straightforward actions [1].

Here is the rough product translation: reasoning is becoming a runtime budget, not a model identity.

Why did Mistral, Anthropic, and OpenAI converge on the same UX?

They converged because reasoning models created the same developer pain for everyone: better answers often cost more time and tokens. A per-request effort knob gives product teams a simple, legible way to trade quality against latency and cost without redesigning their model stack.

Mistral's move is the cleanest example. Reports on Mistral Small 4 describe a single model that combines instruction following, reasoning, multimodal input, and agentic coding. More importantly, it exposes a per-request reasoning_effort parameter, with none positioned for fast chat-like responses and high for deliberate reasoning [5].

Anthropic's pattern is similar, even if the naming differs. Claude Opus 4.6 coverage describes explicit effort controls with low, medium, high, and max levels, plus adaptive thinking so the model can decide when extended reasoning is needed [6].

OpenAI has been pushing the same mental model through its reasoning APIs: a reasoning.effort configuration that developers can pin per call. A recent LLM code-smell taxonomy even treats "reasoning effort not explicitly set" as a reliability and maintainability problem for reasoning-capable systems [3].

The UX convergence is obvious: expose reasoning depth as a first-class request parameter.

Provider	Control surface	Developer mental model	Best use
Mistral	`reasoning_effort`	One model, vary depth per call	Fast chat through deep reasoning
Anthropic	thinking / effort controls	Trade speed, cost, and depth	Coding agents, research, long workflows
OpenAI	`reasoning.effort`	Explicit reasoning budget tier	Production calls with reproducible settings

The names differ. The product idea is the same.

What does research say about reasoning effort?

Research supports the idea that effort control matters, especially for agents, but it also warns against treating effort as a perfect intelligence dial. Reasoning levels can reduce token cost when routed well, yet higher effort does not always mean better output or deeper real-time cognition.

The ARES paper is the most practical source here. It trains a lightweight router to choose low, medium, or high reasoning effort per agent step. Across tool-use, deep-research, and web-navigation tasks, ARES reduced reasoning token usage by up to 52.7% compared with fixed high effort while preserving most task success [1].

That is the strongest argument for this UX: not every step deserves premium thinking.

But there is a catch. Another 2026 paper, "Effort as Ceiling, Not Dial," tested GPT-OSS models across low, medium, and high settings. It found that effort changed trace length only modestly and did not systematically change cognitive-cost alignment with humans. The authors interpret the parameter more like an upper budget than a real-time compute dial [2].

A third study comparing "instant" versus "thinking" modes across providers found that reasoning-token spend varied dramatically by model family. Claude averaged 33 reasoning tokens per call in the study's thinking setup, while Qwen averaged 2,639. The paper's warning is important: "thinking mode" is not a standardized construct across providers [4].

My take: reasoning_effort is a production control, not a guarantee. It gives you a better contract with the model. It does not make every hard problem solved.

How should developers set reasoning_effort in production?

Developers should set reasoning_effort explicitly on every reasoning-capable call and choose the lowest level that reliably solves the task. Use low or none for formatting, extraction, classification, and simple replies. Use high for planning, code edits, ambiguous decisions, and agent recovery.

Here is the pattern I would use in a real product.

If the task is deterministic, short, and low-risk:
  reasoning_effort = "low" or "none"

If the task requires synthesis, comparison, or light debugging:
  reasoning_effort = "medium"

If the task involves multi-step planning, code changes, financial/legal reasoning, or agent recovery:
  reasoning_effort = "high"

The important part is not the exact label. Providers will keep changing labels. The important part is to treat reasoning effort as part of your product contract, like model version, output schema, and max output tokens.

The LLM code-smells paper makes this point directly. If a reasoning-capable model is invoked without explicit reasoning configuration, the system silently relies on provider defaults. Those defaults can differ by model or change over time, which harms reproducibility, maintainability, and reliability [3].

That is why I would log effort next to every request:

{
  "model": "reasoning-model-name",
  "task_type": "code_review",
  "reasoning_effort": "high",
  "max_output_tokens": 1200,
  "schema": "CodeReviewFinding[]"
}

If you write prompts manually, this is also where tools like Rephrase help. You can draft the task in plain English, then use Rephrase to turn it into a tighter prompt with clearer scope, constraints, and output format before choosing the reasoning level.

What does this mean for prompt engineering?

Prompt engineering is shifting from "write better instructions" to "write better request contracts." The prompt still matters, but model, reasoning effort, output format, context budget, and task routing now matter just as much for reliable AI behavior.

Here is a before-and-after example.

Before:

Review this pull request and tell me if anything looks wrong.

After:

You are reviewing a pull request for production readiness.

Task:
Identify correctness, security, and maintainability issues.

Constraints:
Focus only on issues that could affect runtime behavior or future maintenance.
Ignore style-only comments unless they hide a real bug.
Return findings as JSON with: file, line, severity, issue, suggested_fix.

Use high reasoning effort because this review requires multi-step code analysis.

Notice what changed. The improved version does not just ask for "better reasoning." It defines the job, narrows the review surface, specifies the output contract, and explains why high effort is justified.

For a simpler request, do the opposite.

Rewrite this Slack update to be concise and friendly.
Use low reasoning effort. Preserve the facts. Do not add new commitments.

That is the new muscle: matching reasoning depth to task risk. For more examples like this, the Rephrase blog covers practical prompt patterns across coding, writing, and AI tools.

Where does the UX go next?

The next step is adaptive reasoning by default. Developers will still pin effort for sensitive workflows, but most agent systems will learn to route effort per step, using high effort only when the current state is complex, risky, or uncertain.

ARES is basically a preview of that future. It shows that a small router can decide effort levels from interaction history, reducing tokens while keeping task performance close to high-effort baselines [1]. Anthropic's adaptive thinking points in the same direction at the product level [6]. Mistral's single-model, per-request reasoning_effort points there too [5].

The convergence matters because it gives teams a shared abstraction. Whether you build on Mistral, Anthropic, or OpenAI, you can now design around the same question:

How much should the model think for this request?

That question belongs in your prompt workflow, your API wrapper, your logs, and your evals. If you ignore it, you are letting the provider choose your latency, cost, and reasoning behavior for you.

Try this today: take five common AI calls in your product and label them low, medium, or high effort. Then test whether the expensive ones actually need it. You will probably find waste. You may also find a few places where you were under-thinking the important work.

References

Documentation & Research

ARES: Adaptive Reasoning Effort Selection for Efficient LLM Agents - arXiv (link)
Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models - arXiv (link)
LLM Code Smells: A Taxonomy and Detection Approach - arXiv (link)
How Does Thinking Mode Change LLM Moral Judgments? - arXiv (link)
OpenAI API reasoning controls - OpenAI Documentation (link)

Community & Industry Examples

Mistral AI Releases Mistral Small 4 - MarkTechPost (link)
Anthropic Releases Claude Opus 4.6 With Adaptive Reasoning Controls - MarkTechPost (link)

Frequently asked

What is reasoning_effort in AI APIs?

reasoning_effort is a per-request control that tells a reasoning model how much internal thinking budget to use. Higher effort usually improves hard-task reliability but increases latency and cost.

How are OpenAI, Anthropic, and Mistral reasoning controls different?

They expose similar UX but not identical semantics. OpenAI uses reasoning effort tiers, Anthropic exposes thinking or effort controls, and Mistral Small 4 exposes per-request reasoning_effort levels.