Getting an LLM to spit out JSON once is trivial. Getting it to do so reliably across ten thousand messy, inconsistent documents - without hallucinating fields, collapsing nested structures, or silently dropping optional keys - is a genuinely hard prompt engineering problem.
This article is about the hard version.
Key Takeaways
- Schema anchoring - embedding your literal target schema in the prompt - is the single highest-leverage technique for structural consistency
- Graceful degradation requires explicit instructions: tell the model what to do when a field is missing, not just what to return when it succeeds
- Multi-pass verification catches structural errors that single-pass prompts miss, especially on noisy inputs
- Research shows that building intermediate structures before the final output (Structure of Thought) improves extraction accuracy by an average of 5.7% [1]
- Reference and document parsing benchmarks confirm that "structured-output brittleness under noisy layouts" is the primary bottleneck at scale [3]
Why Single-Pass Extraction Breaks Down
A single, well-written extraction prompt works fine in demos. It falls apart in production because real documents are inconsistent. Fields appear in different orders, dates use different formats, nested objects sometimes flatten into prose, and optional sections get omitted entirely.
When a model hasn't been told what to do in these edge cases, it improvises. Sometimes it omits the field. Sometimes it invents a plausible value. Sometimes it returns a slightly different key name. Any of these breaks a downstream parser expecting a rigid schema.
Recent benchmarking on reference extraction tasks across multilingual, footnote-heavy documents found that extraction largely "saturates" for capable models - the model finds the right text - but parsing and end-to-end structured output remain the primary failure points due to brittleness under noisy layouts [3]. The model knows what to extract. It just can't reliably put it in the right box.
Schema Anchoring
Schema anchoring means you stop describing your target schema in natural language and start showing it as a literal JSON template. This is the foundation of everything else in this article.
Here is the difference:
# WEAK - natural language description
Extract the invoice details and return them as JSON with fields for
vendor name, amount, date, and line items.
# STRONG - schema-anchored
Extract invoice details from the document below. Return ONLY valid JSON
matching this exact schema. Do not add keys. Do not remove keys.
{
"vendor_name": "<string | null>",
"invoice_date": "<ISO 8601 date string | null>",
"total_amount": "<number | null>",
"currency": "<3-letter ISO code | null>",
"line_items": [
{
"description": "<string>",
"quantity": "<number | null>",
"unit_price": "<number | null>"
}
]
}
If a field cannot be found or inferred from the text, return null for that field.
Do not fabricate values. Do not omit keys.
DOCUMENT:
{{document}}
The typed placeholders - <string | null>, <ISO 8601 date string | null> - do two things at once. They tell the model what type to use and what to return when the field is absent. That second part is what most prompts skip.
Conditional Field Extraction
Some schemas are inherently conditional. An e-commerce document might have a shipping_address if it's an order confirmation but not if it's a return receipt. A medical note might have a diagnosis_code only if a diagnosis was made.
Handling this with a single flat schema produces garbage. The model will either hallucinate values or return an empty string, both of which break downstream logic differently.
The better approach is to instruct conditional field extraction explicitly:
Extract the following fields from the document.
ALWAYS extract:
- document_type: one of ["order", "return", "invoice", "unknown"]
- customer_id: string | null
EXTRACT ONLY IF document_type is "order":
- shipping_address: { street, city, postcode, country } | null
- estimated_delivery: ISO 8601 date | null
EXTRACT ONLY IF document_type is "invoice":
- due_date: ISO 8601 date | null
- payment_terms: string | null
For all conditional fields that do not apply, omit the key entirely.
Return ONLY valid JSON. No explanation.
DOCUMENT:
{{document}}
This pattern delegates the conditional logic to the model rather than trying to handle it in post-processing. For well-defined document types, it works cleanly. For ambiguous inputs, you still need graceful degradation.
Graceful Degradation
Graceful degradation is a design philosophy: your extraction prompt should produce something useful even when the input is incomplete, malformed, or partially out of domain.
Three rules cover most cases.
First, always specify the null contract. Every optional field gets | null in the schema, and the prompt explicitly says "return null, do not omit." This means your downstream parser always sees the key and can handle null as a first-class case rather than catching a KeyError.
Second, add a _meta object for extraction confidence. This is especially useful for document types where you're uncertain whether the model found real data or guessed:
"_meta": {
"extraction_confidence": "high | medium | low",
"missing_required_fields": ["field_a", "field_b"],
"ambiguous_fields": ["field_c"]
}
Third, define fallback values for typed fields. Dates that can't be parsed become null, not an empty string. Amounts that are textual ("approximately $50") go into a separate raw_amount_text field alongside a null total_amount. This keeps your schema rigid while preserving the raw data for manual review.
Multi-Pass Verification
Even with a perfect schema-anchored prompt, some extraction runs will return structurally invalid JSON - especially on long documents, tables embedded in prose, or heavily nested objects. Multi-pass verification is a second prompt that audits the first output.
Pass 1 extracts. Pass 2 verifies and repairs.
# PASS 2 - Verification prompt
You are a JSON schema validator. Below is an extracted JSON object and
the target schema it should conform to.
Your task:
1. Check that every required key is present
2. Check that all types match the schema
3. Check that null appears where fields are missing (not empty string, not "N/A")
4. If the JSON is invalid, repair it
5. Return ONLY the corrected JSON. No explanation.
TARGET SCHEMA:
{{schema}}
EXTRACTED JSON:
{{pass_1_output}}
This is not just error-catching. It's also a forcing function for the first pass. When you know a second pass will audit the structure, you can write your first-pass prompt to be more aggressive about extraction and less precious about formatting. The two passes have different jobs.
Research on Structure of Thought (SoT) prompting from Duke University reinforces this intuition: explicitly building intermediate structures - rather than jumping directly to a final output - yields measurable accuracy gains across extraction tasks [1]. Multi-pass verification is, in effect, a form of SoT applied to structured output.
Templates for Common Extraction Tasks
Entity Recognition
Extract all named entities from the text below.
Return a JSON array. Each element must match this schema exactly:
{
"entity_text": "<string>",
"entity_type": "PERSON | ORG | LOCATION | DATE | PRODUCT | OTHER",
"start_char": "<integer | null>",
"confidence": "high | medium | low"
}
Do not merge entities. Do not split a single entity across multiple objects.
If no entities are found, return an empty array [].
TEXT:
{{text}}
Document Parsing (invoices, contracts, reports)
Parse the document below into structured JSON.
Follow the schema exactly. Return null for any field not present in the document.
Do not infer or fabricate values.
SCHEMA:
{{paste_schema_here}}
RULES:
- Dates must be ISO 8601 (YYYY-MM-DD) or null
- Currency amounts must be numbers, not strings
- Arrays must always be present, even if empty []
- Return ONLY the JSON object. No markdown. No explanation.
DOCUMENT:
{{document}}
Table Reconstruction
Tables embedded in prose or OCR output are among the hardest extraction targets. The model needs to both identify the table structure and map it to your schema.
The text below contains a table. Reconstruct it as a JSON array of objects.
Rules:
- Each row becomes one object
- Column headers become keys (normalized to snake_case)
- Missing cell values become null
- Numeric strings become numbers where the column is clearly numeric
- Return ONLY the JSON array
If no table is found, return []
TEXT:
{{text}}
Community practice confirms that forcing strict structural adherence - schema first, no conversational text - is essential for machine-readable output [4]. The templates above all follow this principle.
Putting It Together at Scale
At document scale, the practical stack looks like this. Schema-anchored Pass 1 runs on every document. A lightweight validator (Pydantic, Zod, JSON Schema) checks the output. Documents that fail validation get routed to the Pass 2 repair prompt. Documents that fail repair get flagged for human review with the _meta.missing_required_fields list attached.
This tiered approach - extract, validate, repair, escalate - handles the brittleness that benchmarks consistently identify as the core failure mode [3]. You're not trying to write a perfect prompt. You're building a system that degrades gracefully and surfaces its own failures.
If you want to speed up the prompt-writing part of this workflow, tools like Rephrase can help you iterate on your extraction prompts faster - it auto-improves prompts in any app via a global hotkey, which is genuinely useful when you're tuning schema anchors across ten different document types.
The extraction problem is mostly solved. The schema consistency problem is an engineering problem dressed up as a prompting problem. Treat it like one.
References
Documentation & Research
- T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning - arXiv (arxiv.org/abs/2603.03790)
- GraphScout: Empowering LLMs with Intrinsic Exploration Ability for Agentic Graph Reasoning - arXiv (arxiv.org/abs/2603.01410)
- Benchmarking LLMs on Reference Extraction and Parsing in the Social Sciences and Humanities - arXiv (arxiv.org/abs/2603.13651)
Community Examples
- The 'Taxonomy Architect' for Organizing Messy Data - r/PromptEngineering (reddit.com/r/PromptEngineering/comments/1rla5tq)
- Show HN: Smelt - Extract Structured Data from PDFs and HTML Using LLM - GitHub (github.com/akdavidsson/smelt)


-0249.png&w=3840&q=75)
-0250.png&w=3840&q=75)

-0245.png&w=3840&q=75)