Learn how to choose JSON mode or constrained decoding for structured output in 2026, with tradeoffs, examples, and use cases. Try free.
Most teams still ask the wrong question about structured output. They ask, "How do I force valid JSON?" The better question is, "What am I willing to trade to get it?"
Structured output in 2026 means asking an LLM to return machine-readable data, usually JSON, under varying levels of enforcement ranging from plain prompting to schema-constrained decoding. The core design choice is no longer whether to use structure at all, but how hard you want the runtime to enforce it and what tradeoffs you accept in reliability, speed, and reasoning quality [1][2].
If you build AI features into products, structured output is basically the interface layer between language models and code. Tool calls, extraction pipelines, agent state, UI objects, and workflow actions all rely on it. The difference now is that providers expose more than one path: plain JSON instructions, JSON mode, and full schema-based constrained decoding.
That sounds neat. The catch is that these options do not fail the same way.
JSON mode is a lighter contract that aims for valid JSON output, while constrained decoding enforces a schema or grammar token by token during generation. JSON mode is usually cheaper and more flexible, but constrained decoding offers stronger guarantees when the schema is simple and the downstream parser cannot tolerate malformed output [2][3].
Here's the simplest way I think about it: JSON mode is a promise. Constrained decoding is a guardrail.
With JSON mode, the model is still "thinking in language" and then trying to serialize the answer correctly. With constrained decoding, the sampler itself blocks illegal next tokens. Research summarized in ExtractBench describes providers compiling schemas into grammar artifacts or automata, which then restrict valid next-token choices during decoding [2]. That's the big difference.
The practical result is this:
| Approach | What it guarantees | Main upside | Main downside | Best use case |
|---|---|---|---|---|
| Plain "return JSON" prompt | Nothing | Flexible, portable | Chatty wrappers, markdown fences, parse failures | Prototyping, loose extraction |
| JSON mode | Usually valid JSON object shape | Low friction, fast | Can still be semantically wrong or underspecified | Light app integration |
| Constrained decoding | Valid syntax and schema-conforming output | Strongest structural control | Latency, schema limits, reasoning degradation on harder tasks | Tool calls, shallow contracts |
That middle lane matters. A lot of teams jump straight from "return JSON only" to full constrained decoding. Often, JSON mode is enough.
Constrained decoding can hurt performance because the model must satisfy content and structure at the same time, and the grammar itself adds computational and cognitive overhead. On complex tasks, this can reduce validity, lower extraction accuracy, or even trigger schema rejection before generation starts [1][2].
This is the part people tend to miss.
The 2026 paper The Format Tax found that much of the performance hit from structured output appears before decoding constraints even apply. Just asking for structured formats like JSON can reduce reasoning and writing quality, especially in open-weight models [1]. That means some of the damage is prompt-level, not sampler-level.
Then ExtractBench adds a second warning: on complex enterprise extraction tasks, structured output modes actually reduced both validity and accuracy compared with prompt-based extraction [2]. That sounds backwards until you remember what the model is trying to do. It has to understand a long document, map it to a large schema, and obey a rigid grammar with no graceful fallback.
In their benchmark, provider structured output modes struggled badly with large schemas. A 369-field SEC schema produced 0% valid output across tested frontier models, and some resume schemas were rejected outright in structured mode [2]. So yes, constrained decoding can absolutely be the wrong choice.
My rule of thumb is simple: the more your task depends on deep reasoning or massive schema breadth, the less I trust hard constraints to save you.
Use JSON mode when you want structured output quickly, your schema is modest, and occasional retries or post-validation are acceptable. It is the best default for many app features because it keeps prompts and runtime simple without paying the full cost of grammar enforcement [1][3].
JSON mode shines in product features where you need a clean object, not a legally binding contract. Think UI metadata, email classifications, lightweight extraction, content tagging, or assistant responses that feed app logic.
Here's a before-and-after prompt pattern I like:
Before
Summarize this customer message and return JSON only.
After
Read the customer message and return a valid JSON object with exactly these keys:
- sentiment: one of ["positive", "neutral", "negative"]
- priority: one of ["low", "medium", "high"]
- summary: short string under 30 words
Do not include markdown, commentary, or extra keys.
Customer message: {{message}}
That second version still relies on prompting, but it narrows the target enough that JSON mode has a real chance to work.
For lots of everyday prompting, tools like Rephrase help tighten these instructions automatically, especially when you're jumping between apps and don't want to hand-craft every schema request.
One more thing: JSON mode is often good enough if you combine it with validation, retries, and a repair pass. That is still cheaper than forcing every call through the heaviest possible structure system.
Use constrained decoding when malformed output is more expensive than slower generation, and when your schema is small, shallow, and well-supported by the provider. It is strongest in machine-critical workflows where you need strict structure more than open-ended reasoning [2][3].
This is where constrained decoding earns its keep: tool arguments, workflow transitions, API payloads, and deterministic backend actions.
If a single trailing comma can crash your job runner, hard constraints are worth a lot. ExtractBench notes that constrained decoding should, in principle, eliminate formatting failures like trailing commas and truncated JSON [2]. On simpler structures, that benefit is real.
A useful decision pattern looks like this:
| If your priority is... | Use this |
|---|---|
| Fast setup and broad portability | JSON mode |
| Guaranteed parser-safe syntax | Constrained decoding |
| Maximum reasoning quality | Freeform first, then reformat |
| Huge or deeply nested schema extraction | Prompted JSON + validation pipeline |
What I've noticed is that constrained decoding works best when the structure is the task. It works worst when structure competes with the task.
That distinction matters a lot in agents. One cited study on workflow generation found strict JSON constraints could completely break smaller models, while unconstrained reasoning plus downstream parsing performed much better [4]. That won't be true in every setup, but it's a strong reminder not to confuse "more constraints" with "more reliability."
The best production pattern in 2026 is often two-stage generation: let the model reason freely first, then format or validate in a second step. This reduces the format tax, preserves reasoning quality, and gives you more control over repair and fallback behavior [1][4].
If I were designing a new structured-output workflow today, I'd start here:
That pattern lines up with The Format Tax, which found that decoupling reasoning from formatting recovers much of the lost performance [1]. It also matches what practitioners report in the wild: models often "know" the answer, but mess up the wrapper. One Reddit benchmark found only 33.3% of plain prompt responses passed strict json.loads, while 99.5% contained extractable JSON somewhere in the response [5].
That's messy, but it's also useful. It tells you the semantic answer and the serialization layer are often separate problems.
If you want more workflows like this, the Rephrase blog has a growing library of prompt engineering patterns. And if you do this kind of prompt cleanup all day, Rephrase is one of those rare tools that actually saves time instead of creating another workspace to manage.
The short version is this: use JSON mode by default, use constrained decoding when failure is expensive, and split reasoning from formatting when the task gets hard. Structured output is no longer just about valid JSON. It's about choosing the right failure mode.
Documentation & Research
Community Examples 5. I benchmarked 672 "Return JSON only" calls. Strict parsing failed 67% of the time. Here's why. - r/LocalLLaMA (link)
JSON mode usually tells the model to return valid JSON, while constrained decoding actively restricts token generation to a grammar or schema. In practice, constrained decoding is stricter but can add latency and fail on complex schemas.
Use plain prompts when you need flexibility, mixed reasoning plus structure, or when your schema is too large for provider limits. You should still add validation and repair logic downstream.