Blog / Prompt engineering / Structured Output in 2026: What to Use

Structured Output in 2026: What to Use

Learn how to choose JSON mode or constrained decoding for structured output in 2026, with tradeoffs, examples, and use cases. Try free.

Ilia Ilinskii
Rephrase · April 21, 2026

Prompt engineering8 min read

On this page

Key Takeaways What is structured output in 2026?How does JSON mode differ from constrained decoding?Why can constrained decoding hurt performance?When should you use JSON mode?When should you use constrained decoding?What is the best production pattern right now?References

Most teams still ask the wrong question about structured output. They ask, "How do I force valid JSON?" The better question is, "What am I willing to trade to get it?"

Key Takeaways

JSON mode is best when you need fast, lightweight structure and can tolerate occasional cleanup or retries.
Constrained decoding is best when invalid syntax is unacceptable and your schema is small enough to fit provider limits.
Recent research shows a real format tax: structured formatting can reduce reasoning quality, even before decoder constraints kick in [1].
Large or deeply nested schemas can make constrained decoding worse, not better, because grammar enforcement adds overhead and schema rejection risk [2].
In many production workflows, the best move is to separate reasoning from formatting into two steps.

What is structured output in 2026?

Structured output in 2026 means asking an LLM to return machine-readable data, usually JSON, under varying levels of enforcement ranging from plain prompting to schema-constrained decoding. The core design choice is no longer whether to use structure at all, but how hard you want the runtime to enforce it and what tradeoffs you accept in reliability, speed, and reasoning quality [1][2].

If you build AI features into products, structured output is basically the interface layer between language models and code. Tool calls, extraction pipelines, agent state, UI objects, and workflow actions all rely on it. The difference now is that providers expose more than one path: plain JSON instructions, JSON mode, and full schema-based constrained decoding.

That sounds neat. The catch is that these options do not fail the same way.

How does JSON mode differ from constrained decoding?

JSON mode is a lighter contract that aims for valid JSON output, while constrained decoding enforces a schema or grammar token by token during generation. JSON mode is usually cheaper and more flexible, but constrained decoding offers stronger guarantees when the schema is simple and the downstream parser cannot tolerate malformed output [2][3].

Here's the simplest way I think about it: JSON mode is a promise. Constrained decoding is a guardrail.

With JSON mode, the model is still "thinking in language" and then trying to serialize the answer correctly. With constrained decoding, the sampler itself blocks illegal next tokens. Research summarized in ExtractBench describes providers compiling schemas into grammar artifacts or automata, which then restrict valid next-token choices during decoding [2]. That's the big difference.

The practical result is this:

Approach	What it guarantees	Main upside	Main downside	Best use case
Plain "return JSON" prompt	Nothing	Flexible, portable	Chatty wrappers, markdown fences, parse failures	Prototyping, loose extraction
JSON mode	Usually valid JSON object shape	Low friction, fast	Can still be semantically wrong or underspecified	Light app integration
Constrained decoding	Valid syntax and schema-conforming output	Strongest structural control	Latency, schema limits, reasoning degradation on harder tasks	Tool calls, shallow contracts

That middle lane matters. A lot of teams jump straight from "return JSON only" to full constrained decoding. Often, JSON mode is enough.

Why can constrained decoding hurt performance?

Constrained decoding can hurt performance because the model must satisfy content and structure at the same time, and the grammar itself adds computational and cognitive overhead. On complex tasks, this can reduce validity, lower extraction accuracy, or even trigger schema rejection before generation starts [1][2].

This is the part people tend to miss.

The 2026 paper The Format Tax found that much of the performance hit from structured output appears before decoding constraints even apply. Just asking for structured formats like JSON can reduce reasoning and writing quality, especially in open-weight models [1]. That means some of the damage is prompt-level, not sampler-level.

Then ExtractBench adds a second warning: on complex enterprise extraction tasks, structured output modes actually reduced both validity and accuracy compared with prompt-based extraction [2]. That sounds backwards until you remember what the model is trying to do. It has to understand a long document, map it to a large schema, and obey a rigid grammar with no graceful fallback.

In their benchmark, provider structured output modes struggled badly with large schemas. A 369-field SEC schema produced 0% valid output across tested frontier models, and some resume schemas were rejected outright in structured mode [2]. So yes, constrained decoding can absolutely be the wrong choice.

My rule of thumb is simple: the more your task depends on deep reasoning or massive schema breadth, the less I trust hard constraints to save you.

When should you use JSON mode?

Use JSON mode when you want structured output quickly, your schema is modest, and occasional retries or post-validation are acceptable. It is the best default for many app features because it keeps prompts and runtime simple without paying the full cost of grammar enforcement [1][3].

JSON mode shines in product features where you need a clean object, not a legally binding contract. Think UI metadata, email classifications, lightweight extraction, content tagging, or assistant responses that feed app logic.

Here's a before-and-after prompt pattern I like:

Before

Summarize this customer message and return JSON only.

After

Read the customer message and return a valid JSON object with exactly these keys:
- sentiment: one of ["positive", "neutral", "negative"]
- priority: one of ["low", "medium", "high"]
- summary: short string under 30 words
Do not include markdown, commentary, or extra keys.
Customer message: {{message}}

That second version still relies on prompting, but it narrows the target enough that JSON mode has a real chance to work.

For lots of everyday prompting, tools like Rephrase help tighten these instructions automatically, especially when you're jumping between apps and don't want to hand-craft every schema request.

One more thing: JSON mode is often good enough if you combine it with validation, retries, and a repair pass. That is still cheaper than forcing every call through the heaviest possible structure system.

When should you use constrained decoding?

Use constrained decoding when malformed output is more expensive than slower generation, and when your schema is small, shallow, and well-supported by the provider. It is strongest in machine-critical workflows where you need strict structure more than open-ended reasoning [2][3].

This is where constrained decoding earns its keep: tool arguments, workflow transitions, API payloads, and deterministic backend actions.

If a single trailing comma can crash your job runner, hard constraints are worth a lot. ExtractBench notes that constrained decoding should, in principle, eliminate formatting failures like trailing commas and truncated JSON [2]. On simpler structures, that benefit is real.

A useful decision pattern looks like this:

If your priority is...	Use this
Fast setup and broad portability	JSON mode
Guaranteed parser-safe syntax	Constrained decoding
Maximum reasoning quality	Freeform first, then reformat
Huge or deeply nested schema extraction	Prompted JSON + validation pipeline

What I've noticed is that constrained decoding works best when the structure is the task. It works worst when structure competes with the task.

That distinction matters a lot in agents. One cited study on workflow generation found strict JSON constraints could completely break smaller models, while unconstrained reasoning plus downstream parsing performed much better [4]. That won't be true in every setup, but it's a strong reminder not to confuse "more constraints" with "more reliability."

What is the best production pattern right now?

The best production pattern in 2026 is often two-stage generation: let the model reason freely first, then format or validate in a second step. This reduces the format tax, preserves reasoning quality, and gives you more control over repair and fallback behavior [1][4].

If I were designing a new structured-output workflow today, I'd start here:

Ask the model to solve the task in freeform or semi-structured form.
Run a second pass that reformats the result into JSON mode or a constrained schema.
Validate strictly.
Retry only the formatting step if validation fails.

That pattern lines up with The Format Tax, which found that decoupling reasoning from formatting recovers much of the lost performance [1]. It also matches what practitioners report in the wild: models often "know" the answer, but mess up the wrapper. One Reddit benchmark found only 33.3% of plain prompt responses passed strict json.loads, while 99.5% contained extractable JSON somewhere in the response [5].

That's messy, but it's also useful. It tells you the semantic answer and the serialization layer are often separate problems.

If you want more workflows like this, the Rephrase blog has a growing library of prompt engineering patterns. And if you do this kind of prompt cleanup all day, Rephrase is one of those rare tools that actually saves time instead of creating another workspace to manage.

The short version is this: use JSON mode by default, use constrained decoding when failure is expensive, and split reasoning from formatting when the task gets hard. Structured output is no longer just about valid JSON. It's about choosing the right failure mode.

References

Documentation & Research

The Format Tax - arXiv cs.CL (link)
ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction - arXiv cs.LG (link)
Token-Oriented Object Notation vs JSON: A Benchmark of Plain and Constrained Decoding Generation - arXiv cs.CL (link)
FlowMind: Execute-Summarize for Structured Workflow Generation from LLM Reasoning - arXiv cs.AI (link)

Community Examples 5. I benchmarked 672 "Return JSON only" calls. Strict parsing failed 67% of the time. Here's why. - r/LocalLLaMA (link)

Frequently asked

What is the difference between JSON mode and constrained decoding?

JSON mode usually tells the model to return valid JSON, while constrained decoding actively restricts token generation to a grammar or schema. In practice, constrained decoding is stricter but can add latency and fail on complex schemas.

When should I use plain prompts instead of structured output APIs?

Use plain prompts when you need flexibility, mixed reasoning plus structure, or when your schema is too large for provider limits. You should still add validation and repair logic downstream.