Most prompt engineering advice assumes short, clean input. A question, a task, maybe a paragraph of context. Real documents don't work like that. A legal contract runs 60 pages. A 10-K report is 200. A research PDF includes footnotes that contradict the abstract.
When your input is a structured document and your output needs to be precise, everything - chunk boundaries, retrieval order, prompt framing - affects quality in ways that casual prompting ignores.
Here's how to get it right.
Key Takeaways
- Chunk by semantic unit (section, table, clause), not by token count
- Retrieval order matters: place the most relevant chunks at the start or end of context, not the middle
- Write extraction prompts that carry document structure as explicit metadata
- Always instruct the model to surface contradictions rather than silently resolve them
- Summarize for overview tasks; chunk-and-retrieve for precision tasks
Why Long Documents Break Normal Prompting
The fundamental problem is that most production LLMs have a context window that fits the document - technically. But fitting isn't the same as reasoning well over it. Attention mechanisms in transformer-based models don't weight all positions equally. Content buried in the middle of a long context window gets less reliable treatment than content at the edges. This is sometimes called the "lost in the middle" problem, and it directly affects output quality on documents like annual reports where critical disclosures appear in section 7 of 12.
There's also a structural problem. Legal contracts and financial reports are designed for humans who can navigate a table of contents, jump to a specific clause, and hold cross-references in their head. When you dump the full text into a prompt, you strip that navigational structure. The model sees a wall of text, not a document.
Anthropic's prompt engineering documentation [1] recommends giving models explicit structural context - what the document is, what role it plays, and what you want extracted - rather than assuming the model will infer purpose from raw text alone. OpenAI's guidance [2] emphasizes similar principles: be explicit about format, role, and expected output shape before the model encounters the substantive content.
Both align on a core principle: structure in, structure out.
Split vs. Summarize: When to Use Each
The first decision is architectural. You're either chunking the document and querying chunks, or you're summarizing sections first and querying the summaries. These aren't interchangeable.
Chunking preserves detail. Use it when you need exact clause language, specific numbers, named parties, or any information where approximation is unacceptable. Contract review is the canonical case. "Does this agreement include a non-compete clause, and what is its scope?" requires the actual text of that clause, not a summary that might paraphrase away a carve-out.
Summarization trades precision for breadth. Use it when you need to understand a document's overall structure, or when you're comparing themes across multiple documents. Generating an executive summary of a 10-K, or comparing the risk sections of five different vendor contracts, works well with summarization passes.
A hybrid approach - summarize each section to build a map, then retrieve full-text chunks for specific queries - is the most robust for complex workflows. Recent work on financial document parsing demonstrates that building a globally consistent table of contents from the document's own heading hierarchy dramatically improves structure-aware retrieval [3]. You're essentially giving the model a GPS before asking it to navigate.
Chunking Strategies That Preserve Structure
Fixed-size chunking is the default in most RAG tutorials, and it's wrong for structured documents. Splitting every 512 tokens means a table ends mid-row, a clause gets cut from its header, and a footnote reference loses its anchor. The model gets fragments, not units.
Chunk by logical boundary instead. For contracts: definitions section, representations and warranties, covenants, termination provisions, schedules. For financial reports: narrative MD&A sections, individual financial statement tables, footnotes as standalone units. For research PDFs: abstract, each named section, figures with captions, references.
Each chunk should carry a metadata header. Not as a note to yourself - as explicit content the model reads:
[Document: Master Services Agreement - Acme Corp / Vendor Ltd]
[Section: 8. Limitation of Liability]
[Pages: 14-15]
[Related sections: 9. Indemnification, Schedule B]
SECTION 8. LIMITATION OF LIABILITY
8.1 In no event shall either party be liable...
That header costs tokens. It's worth every one. It gives the model orientation, prevents it from treating the chunk as a standalone document, and makes it possible to produce citations that actually point somewhere useful.
Writing Extraction Prompts That Work
A bad extraction prompt asks for everything and specifies nothing:
Summarize the key points of this contract section.
A good extraction prompt defines the target, the format, and what to do with ambiguity:
You are reviewing Section 8 of a Master Services Agreement.
Extract the following, using exact quoted language where possible:
1. The liability cap amount or formula
2. Whether the cap applies to both parties or only one
3. Any carve-outs from the cap (e.g., gross negligence, IP infringement)
4. The specific clause reference (e.g., "8.1", "8.2(b)")
If a field is not present in this section, write "Not found in this section."
Do not infer or paraphrase if you cannot find explicit text.
Output as a JSON object.
The instruction "do not infer if you cannot find explicit text" is critical for legal and financial work. Models are trained to be helpful, which creates a pull toward plausible-sounding guesses. You need to override that explicitly.
Here's a real before/after for a financial report extraction task:
Before:
What does this section say about revenue?
After:
You are analyzing the Revenue Recognition section (Note 3) of a publicly filed 10-K.
Extract:
- Total reported revenue for fiscal year 2025
- Revenue breakdown by segment, if provided
- Any changes to revenue recognition methodology compared to prior year
- Any material uncertainties flagged by management
Cite the specific paragraph or table for each finding.
If figures appear in both the narrative and a table and they differ, flag the discrepancy.
The second prompt produces auditable output. The first produces a paragraph you can't verify.
Handling Conflicting Information Across Sections
This is where most document prompting workflows quietly fail. A contract's definition section says "Business Day means any day except Saturday, Sunday, or a federal holiday." Section 12 references a notice period of "5 Business Days." An exhibit added during negotiation defines Business Day differently, including Saturdays.
If you ask the model to tell you the notice period without flagging that conflict, it will give you a confident answer that might be wrong depending on which definition governs.
The fix is an explicit contradiction instruction in every extraction prompt that spans multiple sections:
If you find a term defined differently in different sections or exhibits,
do not pick one definition. List both definitions, identify where each appears,
and flag this as a conflict requiring review.
For cross-document comparison - say, reviewing five vendor NDAs to identify which ones have survival clauses - the same principle applies at document level. A comparison prompt should include:
For each document, extract the survival clause language verbatim.
If documents use inconsistent terminology for the same concept
(e.g., "surviving obligations" vs. "post-termination obligations"),
note the terminology difference.
Do not normalize language across documents.
Preserving the variation is the point. Normalizing it away is how legal risk gets missed.
Retrieval Order and the Middle Problem
If you're building a RAG pipeline over document chunks, the order chunks appear in the context window affects output quality independent of their content. Research on retrieval-augmented generation consistently shows that information placed in the first 20% or last 20% of a long context window gets reliably attended to; content in the middle does not [4].
The practical implication: when assembling a context window from retrieved chunks, put your highest-relevance chunk first, your second-highest last, and fill the middle with supporting context. If you're working with a model that has strong recency bias, lead with the most critical chunk even if it means breaking strict document order.
For document analysis specifically, also consider putting the extraction instruction and output schema at the end of the prompt, after the document content. Anthropic's documentation recommends this placement for complex extraction tasks - it keeps the instruction fresh when the model begins generating [1].
Worked Example: Contract Review Pipeline
Here's a compact architecture for a clause-extraction pipeline:
| Step | What happens | Prompt strategy |
|---|---|---|
| Parse | Convert PDF to structured text, preserve headings | Semantic chunking by section |
| Map | Build a section index with page references | Ask model to output TOC as JSON |
| Extract | Run targeted extraction per clause type | Per-section prompts with JSON schema |
| Reconcile | Compare findings across sections/exhibits | Contradiction-flagging prompt |
| Output | Structured report with citations | Format as Markdown table with clause refs |
The reconciliation step is the one most pipelines skip. It's also the one where production document review workflows catch the most errors - both in human review and in model-assisted pipelines [4].
Worked Example: Cross-Document Comparison
Comparing multiple research PDFs for a literature review, or five vendor contracts for procurement, follows the same logic at a higher level. Process each document individually first, extract to a consistent schema, then run a comparison pass over the structured outputs - not over raw document text.
If you try to stuff three 50-page reports into a single context and ask "what do these agree on?", you're back to the lost-in-the-middle problem at scale. Process in parallel, structure the outputs, then compare the structures.
A tool like Rephrase can help when you're iterating on extraction prompt drafts across multiple document types - quickly rewriting a prompt template for contracts vs. research papers vs. financial reports without manually restructuring from scratch each time.
Getting the Basics Right First
Before worrying about retrieval order or contradiction handling, make sure your document parsing is solid. Research on financial PDF extraction shows that cross-page structural discontinuities - a table that spans two pages, a section heading orphaned from its body - are a primary failure mode before you even reach the prompting layer [3]. If your OCR or PDF-to-text conversion is breaking document structure, no prompt strategy fixes that downstream.
Parse first. Chunk correctly. Then prompt precisely.
The LLM is the last step in a pipeline, not the first. Treat your document processing with the same engineering rigor you'd apply to a data pipeline - because that's exactly what it is.
For more articles on structured prompting and document workflows, visit the Rephrase blog.
References
Documentation & Research
- Prompt Engineering Overview - Anthropic (docs.anthropic.com)
- Prompt Engineering Guide - OpenAI (platform.openai.com)
- Agentar-Fin-OCR: Document Parsing for Financial PDFs - Qian et al., Ant Group (arxiv.org/abs/2603.11044)
- IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation - (arxiv.org/abs/2602.23481)
Community Examples
- Legal automation pipeline discussion - r/PromptEngineering (reddit.com)
-0267.png&w=3840&q=75)

-0262.png&w=3840&q=75)
-0264.png&w=3840&q=75)
-0265.png&w=3840&q=75)
-0266.png&w=3840&q=75)