Blog / Prompt engineering / Multi-Modal Prompting: GPT-5, Gemini 3,…

Multi-Modal Prompting: GPT-5, Gemini 3, Claude 4

Learn how to structure multi-modal prompts across GPT-5, Gemini 3, and Claude 4 with reusable templates and a split-vs-combine decision framework. Read the full guide.

Ilia Ilinskii
Rephrase · March 24, 2026

Prompt engineering9 min read

On this page

Key Takeaways Why Multi-Modal Pipelines Break Differently The Split vs. Combine Decision Framework Structuring Multi-Modal Prompts: The Template Pattern Model-Specific Behavior: GPT-5, Gemini 3, Claude 4 GPT-5 Gemini 3 Claude 4 Handoff Patterns Between Modalities Reducing Format Drift Over Long Chains Before and After: Multi-Modal Prompt Transformation References

Most prompt engineering advice is written for a clean, text-only world. But your actual workflow probably isn't that clean. You're feeding in a screenshot, a PDF, a voice transcript, and a system instruction - all in the same chain - and wondering why the output keeps going sideways.

Multi-modal prompting is genuinely different from single-modality work, and the differences aren't just cosmetic. The failure modes are different. The structuring rules are different. And the decision of whether to combine inputs or split them across steps has real consequences for cost, latency, and output quality.

Here's what actually works in 2026.

Key Takeaways

Modality order inside a prompt matters: anchor with text, then attach media inputs
"Combine vs. split" is a dependency question, not a preference question
Silent truncation and modality bleed are the two failure modes you won't catch in single-modality testing
GPT-5, Gemini 3, and Claude 4 have meaningfully different behaviors for mixed-input prompts
Reusable templates with typed slots reduce format drift across long chains

When a text-only prompt fails, the failure is usually visible - the output is wrong, incomplete, or off-topic. Multi-modal failures are sneakier. Research on adaptive tool orchestration frameworks shows that non-text modality paths require explicit decomposition strategies because models don't naturally separate what they "saw" from what they "read" when both inputs are present [3]. That blurring is what I call modality bleed: the model's analysis of an image leaks into its interpretation of an accompanying document, or vice versa.

The second failure mode is silent truncation. Long PDFs attached to a prompt rarely throw an error when they exceed the model's processing capacity - they just get quietly cut off, and the model reasons over an incomplete document without telling you. This is especially dangerous in document-plus-image workflows where you assume both inputs were fully processed.

Both of these fail silently. That's the core problem.

The Split vs. Combine Decision Framework

Before you write a single line of a multi-modal prompt, answer one question: does the model need to see all inputs simultaneously to reason correctly, or can it process them independently?

If the answer is "simultaneously," combine them. If the answer is "independently," split them.

Here's the framework as a practical table:

Scenario	Combine or Split	Reason
Image + text where image IS the subject	Combine	Model needs visual context to interpret the text question
PDF summary + follow-up Q&A	Split	Summarize first, then query the summary
Audio transcript + sentiment analysis	Split	Transcribe first, analyze text output
Screenshot + bug report	Combine	Visual and textual context are co-dependent
Multiple documents + cross-reference task	Split into chunks, then combine	Avoids silent truncation; merge summaries in final step
Voice memo + calendar data + scheduling task	Split then combine	Process each source, synthesize in final prompt

The underlying logic comes from how distributed pipeline schedulers think about workflow graphs [1]: when components have shared data dependencies, they need to run in the same stage. When they don't, parallelizing or sequencing them separately is almost always more efficient and more debuggable.

Regardless of which model you're using, multi-modal prompts benefit from a consistent slot-based structure. Think of it as typed inputs - you declare what each piece is before the model processes it. This reduces format drift significantly in multi-step chains.

Here's the base template:

[CONTEXT]
You are a [role]. Your task is to [task description].

[INPUT: TEXT]
{text_content}

[INPUT: IMAGE]
{image_or_image_url}
Description hint: {optional_caption_or_label}

[INPUT: DOCUMENT]
{document_content_or_extracted_text}

[TASK]
Using the inputs above, [specific instruction].

[OUTPUT FORMAT]
Return your response as [JSON / markdown / plain text] with the following fields:
- field_1: [description]
- field_2: [description]

The [INPUT: TYPE] labels are not just for readability. They act as soft anchors that help the model keep modalities conceptually separate. In testing, removing these labels increases modality bleed errors noticeably - especially on Claude 4, which is sensitive to structural cues in the prompt.

Model-Specific Behavior: GPT-5, Gemini 3, Claude 4

These three models handle multi-modal inputs differently enough that you should adapt your template per model. Here's what I've found in practice:

GPT-5

GPT-5 handles interleaved image-text well - you can alternate between image references and text instructions without major degradation. The catch is output format consistency. When you mix modalities, GPT-5 tends to produce more verbose, conversational outputs unless you include an explicit output format block. Always end multi-modal GPT-5 prompts with a strict format instruction. JSON schema hints work better than prose descriptions.

[OUTPUT FORMAT]
Respond only with valid JSON matching this schema:
{"finding": string, "confidence": "high" | "medium" | "low", "source_modality": string}

Gemini 3

Gemini 3's long context window is its biggest advantage for multi-modal work. It can genuinely process long PDFs alongside images without truncating either, which makes it the right choice for document-heavy pipelines. The failure mode to watch for here is instruction drift in very long prompts - task instructions placed early in the prompt can get de-weighted when the document fills the context. Put your task instructions at the end, not the beginning.

[DOCUMENT]
{full_pdf_extracted_text}

[IMAGE]
{image}

[TASK - READ THIS LAST, EXECUTE FIRST]
Summarize the discrepancies between the document data and the image visualization.
Return three bullet points maximum.

Claude 4

Claude 4 is the strongest model for structured document parsing. It respects schema instructions reliably and handles multi-document inputs well. Its weakness is audio-adjacent tasks - if you're feeding in transcripts, you need to explicitly label them as transcripts (not just paste the text), or Claude will treat them as prose and miss speaker-dependent context.

[INPUT: TRANSCRIPT]
Source: Auto-generated speech-to-text from customer call recording
Speaker labels: AGENT, CUSTOMER
{transcript_content}

[TASK]
Identify the top two customer complaints and classify each by sentiment.

Handoff Patterns Between Modalities

In multi-step chains, the output of one modality step becomes the input of the next. This handoff is where pipelines most commonly degrade. Research on real-time multi-modal serving confirms that managing the handoff between language, audio, and visual generation stages - each with different resource and latency profiles - is the primary engineering challenge in production systems [2].

For prompting purposes, the practical equivalent is making sure the output format of step N is explicitly compatible with the input format of step N+1. Don't rely on the model to infer this.

Here's a concrete handoff example - audio transcript to structured analysis:

Step 1: Transcription prompt output

{
  "transcript": "...",
  "speakers": ["AGENT", "CUSTOMER"],
  "duration_seconds": 247
}

Step 2: Analysis prompt input

[INPUT: STRUCTURED TRANSCRIPT]
The following is a JSON object from a previous transcription step.
Parse the "transcript" field and the "speakers" field to complete your task.

{paste Step 1 output here}

[TASK]
Identify all unresolved customer issues. List each with the speaker turn where it was raised.

Explicitly naming the source ("from a previous transcription step") primes the model to treat the input as a structured artifact rather than freeform text. This small framing choice reduces misinterpretation errors significantly.

The "one supervisor, many modalities" architecture described in recent orchestration research formalizes this pattern at a systems level - a central agent decomposes the task, routes each modality to the right tool, then synthesizes outputs [3]. In manual prompting, you're doing this decomposition yourself, which means being explicit about it in each prompt is the only way to maintain coherence across steps.

Reducing Format Drift Over Long Chains

The longer your chain, the more output format degrades. Each model call introduces small variations in how it structures its response, and these compound. By step 5 of a 6-step chain, your structured JSON often looks like structured JSON with prose mixed in.

Two techniques help. First, include your output schema in every step, not just the first. Yes, it adds tokens. It's worth it. Second, use a validation step - a cheap, fast model call that checks whether the previous output matches the expected schema before passing it downstream. This is essentially what schema-gated workflow approaches do for scientific pipelines [4], and the same principle applies here.

If you're iterating on multi-modal prompts regularly across different tools and apps, Rephrase can auto-detect the modality context of what you're working on and rewrite your prompt to match the expected input structure for the target model - which cuts the iteration loop down from minutes to seconds.

Before (typical first attempt):

Look at this image and the attached PDF and tell me what's wrong with the data.

After (structured multi-modal prompt):

[CONTEXT]
You are a data analyst reviewing a quarterly report for inconsistencies.

[INPUT: IMAGE]
{chart_screenshot}
Description hint: Bar chart showing Q1-Q4 revenue by region, from the slide deck.

[INPUT: DOCUMENT]
{extracted_pdf_text}
Source: Q4 financial report, pages 4-7 only.

[TASK]
Identify any discrepancies between the chart values and the figures in the document.
List each discrepancy as: {region}, {chart_value}, {document_value}, {delta}.

[OUTPUT FORMAT]
Return a JSON array. Each item: {"region": string, "chart_value": number, "doc_value": number, "delta": number}

The difference isn't complexity - it's structure. Labeled inputs, bounded document scope, explicit output schema. That's the whole pattern.

Multi-modal prompting rewards the same discipline that good API design rewards: explicit contracts between components, typed inputs, and no assumptions about what the model will infer. Get that right and the modality combination becomes almost irrelevant. Get it wrong and you'll be debugging failures that only appear when two input types are present simultaneously.

For more on prompt structuring techniques, browse the Rephrase blog.

References

Documentation & Research

WORKSWORLD: A Domain for Integrated Numeric Planning and Scheduling of Distributed Pipelined Workflows - Taylor Paul, William Regli, University of Maryland (arxiv.org)
StreamWise: Serving Multi-Modal Generation in Real-Time at Scale - Zhang et al., Microsoft Azure Research (arxiv.org)
One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries - Saini & Bishwas, PwC US (arxiv.org)
Talk Freely, Execute Strictly: Schema-Gated Agentic AI for Flexible and Reproducible Scientific Workflows (arxiv.org)

Frequently asked

What is a multi-modal prompt?

A multi-modal prompt combines more than one input type - text, images, audio, or documents - in a single request to an AI model. Structuring these inputs correctly is critical because each modality has different context windows, token costs, and failure modes.

How do GPT-5, Gemini 3, and Claude 4 differ for multi-modal prompting?

GPT-5 handles interleaved image-text well but requires explicit output format instructions when mixing modalities. Gemini 3 has the largest native context window and excels at long document plus image tasks. Claude 4 is strong at structured document parsing and produces more predictable output schemas.

Can I automate multi-modal prompt structuring?

Yes. Tools like Rephrase can detect the modality context of what you're working on and rewrite your prompt to match the expected input structure for the target model, saving significant iteration time.