Blog / Generative ai / Prompting Text AI vs Image AI: Totally D…

Prompting Text AI vs Image AI: Totally Different Games

The fundamental differences between prompts for LLMs and generative AI for images and video.

Ilia Ilinskii
Rephrase · Dec 23, 2025

Generative ai12 min read

On this page

Here's something that tripped me up when I started generating images. I thought writing prompts was writing prompts. Same skill, different tool. Nope.

Prompts for ChatGPT and prompts for Midjourney require completely different thinking. One wants structure and instructions. The other wants descriptions and vibes. Once I understood this, my results got way better.

Let me break it down.

Two Different Worlds

Language Models (LLMs)

These are trained on text to generate, analyze, and process text.

Examples: ChatGPT, Claude, Gemini, Llama, Mistral

Typical tasks: writing, analysis, code, Q&A, summarization

Generative Models (Images, Video)

These create visual content from text descriptions.

For images: DALL-E 3, Midjourney V7, Stable Diffusion 3.5, Flux, Ideogram, GPT-4o Image, Nano Banana

For video: Sora 2, Veo 3, Runway Gen-4, Kling 2.6, Pika, Luma

The Core Difference

LLM Prompts: Instructions and Structure

Element	Purpose
Role	Sets expertise and style
Instructions	Step-by-step what to do
Context	Background info and data
Constraints	What NOT to do
Output format	Structure of the response

LLM prompt example:

<role>You are a senior marketing analyst</role>
<instructions>
Analyze the campaign data and provide:
1. Key metrics summary
2. Performance trends
3. Recommendations
</instructions>
<constraints>
- Use only provided data
- Be concise (max 500 words)
</constraints>
<data>{{CAMPAIGN_DATA}}</data>

Image Prompts: Descriptions and Visual Attributes

Element	Purpose
Style	Artistic aesthetic
Subject	Main object/character
Setting	Environment and context
Lighting	Light sources and mood
Composition	Angle and framing
Technical	Resolution, aspect ratio

Image prompt example:

A photorealistic portrait of an elderly Japanese ceramicist
with deep, sun-etched wrinkles and a warm, knowing smile.
Natural window light from the left, shallow depth of field,
neutral background. Serene and masterful mood.

See the difference? One is giving orders. The other is painting a picture with words.

Quick Comparison

Aspect	LLMs	Image/Video Models
Format	Structured (XML/Markdown)	Descriptive text
Keywords vs sentences	Full sentences	Depends on model*
Negative instructions	`<constraints>` tags	Negative prompts
Iteration	Dialogue and refinement	Rerolls and variations
Examples	Text examples	Reference images
Length control	Specified in instructions	Not applicable
Style control	Tone and format	Artistic aesthetic

*Modern models like Midjourney V6+, Flux, Nano Banana prefer full descriptive sentences over keyword lists.

Writing Image Prompts

Basic Structure

[Style/Aesthetic] + [Subject] + [Setting] + [Lighting] + [Composition] + [Technical]

The Big Shift: Sentences Over Keywords

Older approach (doesn't work as well anymore):

woman, red dress, cafe, morning, coffee, vintage, 4k, award winning

Modern approach (works much better):

A young woman in a flowing crimson dress sits at a Parisian sidewalk cafe,
her fingers wrapped around a steaming espresso cup as golden morning light
filters through the awning, creating soft shadows on the vintage iron table.

Modern models - especially Midjourney V6+, Flux, and Nano Banana - understand descriptive sentences much better than keyword lists.

Platform-Specific Examples

DALL-E 3 / GPT-4o Image

A high-resolution, studio-lit product photograph of a minimalist ceramic
coffee mug in matte black, presented on a polished concrete surface.
Soft diffused lighting from above, subtle shadow, clean background.
Square image.

Midjourney V7

Haute-couture advertising campaign photographed by Erik Madigan Heck.
Two models wearing Comme des Garcons Avant-Garde costume.
Mongol steppe in background. Northern lights in sky --ar 2:3 --v 7

Midjourney has special parameters:

--ar 2:3 - aspect ratio
--v 7 - model version
--cref [URL] - character reference
--sref [URL] - style reference

Stable Diffusion 3.5

Positive: majestic lion with golden mane, hyperrealistic, 8K, detailed fur
Negative: blurry, low quality, distorted, bad anatomy, extra fingers

SD's superpower: full negative prompts and prompt weighting with (keyword:1.5).

Flux

A hyperrealistic portrait of a weathered sailor in his 60s,
with deep-set blue eyes, a salt-and-pepper beard, and sun-weathered skin.
He's wearing a faded blue captain's hat and a thick wool sweater.
The background shows a misty harbor at dawn.

Flux uses dual-encoder (T5 + CLIP) and has the best text rendering in the industry.

Writing Video Prompts

Basic Structure

[CAMERA/SHOT] + [SUBJECT] + [ACTION] + [ENVIRONMENT] + [STYLE] + [AUDIO]

Components

Subject - who/what is in focus
Context - where the action happens
Action - what the subject does
Style - visual aesthetic
Camera - shot type and movement
Lighting - mood and atmosphere
Audio - sound effects, dialogue (for Sora 2, Veo 3)

Platform Examples

Sora 2 (OpenAI)

Style: Hand-painted 2D/3D hybrid animation with soft brush textures.

Inside a cluttered workshop, a small round robot sits on a wooden bench.

Cinematography:
Camera: medium close-up, slow push-in with gentle parallax
Lens: 35mm virtual lens; shallow depth of field
Lighting: warm key from overhead; cool spill from window

Actions:
- The robot taps the bulb; sparks crackle.
- It flinches, dropping the bulb.
- Robot says: "Almost lost it... but I got it!"

What I learned about Sora:

Short clips (4 sec) are more stable than long ones
One camera move per shot
Dialogue goes in a separate block

Veo 3 (Google)

Camera: Medium shot, slow push-in
Subject: A seasoned grey-bearded man in sunglasses and paisley shirt
Setting: Vibrant mural wall background
Audio: Faint city murmurs, distant chatter, mellow soulful hip-hop beat
Dialogue: [Character says: "This is the moment..."]

Veo 3 generates audio natively - describe sounds in separate sentences.

Kling 2.6

A static shot of a burger as it assembles in mid-air.
The entire shot is in dramatic slow-motion.
Background is a clean professional studio gradient.
Style: TV food commercial
++sleek red convertible++

Kling uses ++keyword++ to emphasize important elements.

Handling "Don't Do This" Instructions

LLMs: Constraints in Structure

<constraints>
- Do not include personal opinions
- Do not exceed 500 words
- Do not use technical jargon
</constraints>

Images: Negative Prompts

Stable Diffusion:

Negative: blurry, low quality, distorted, bad anatomy, extra fingers,
watermark, text, signature

Midjourney:

--no text, watermark, blurry background

Semantic negatives (Nano Banana, GPT-4o Image):

No extra fingers or hands; no text except the title;
avoid watermarks; avoid clutter; no background distractions.

Video: Exclusions

Avoid Dutch angles; no on-screen text; no lens flare;
no subtitle overlays; no watermarks.

When to Use What

Use LLMs for:

Text analysis and processing
Content generation (articles, posts, emails)
Programming and code review
Q&A and research
Document summarization
Translation and localization

Use Image Generation for:

Marketing visuals
Concept art and illustrations
Product mockups
Social media content
Stickers and icons
Infographics

Use Video Generation for:

Short promo clips
Product videos
Social media content
B-roll footage
Animated concepts
Music visualizations

Platform Comparison Tables

Image Generation

Platform	Text in Image	Prompt Adherence	Negative Prompts	Best For
DALL-E 3	Okay	Good	None	General tasks
Midjourney V7	Okay	Good	`--no`	Artistic quality
Stable Diffusion 3.5	Good	Good	Full support	Customization
Flux	Excellent	Excellent	Limited	Text, realism
Ideogram	Excellent	Good	Limited	Typography
GPT-4o Image	Excellent	Good	Semantic	Conversational editing
Nano Banana	Good	Good	Semantic	Speed, editing

Video Generation

Platform	Duration	Audio	Physics	Best For
Sora 2	10-20 sec	Excellent	Excellent	Complex scenes
Veo 3.1	4-8 sec	Excellent	Good	Native audio
Runway Gen-4	10 sec	Okay	Okay	Image-to-video
Kling 2.6	5-10 sec	Good	Good	Lip-sync

Tips for Both

For LLMs

Structure your prompt with XML or Markdown
Set a role for expertise and style
Be explicit, especially for Claude
Use few-shot examples for complex tasks
Iterate through dialogue

For Generative Models

Describe, don't list keywords
Always specify lighting - it dramatically affects results
Use reference images when available (cref, sref)
Add negative prompts to exclude unwanted elements
Experiment with variations - each generation is unique

Universal Principles

Specificity matters everywhere - precise descriptions get better results
Know your platform's quirks - each model is different
Iterate and improve - first prompt is rarely perfect
Study examples - see what works for others

The Takeaway

LLM prompts and image/video prompts are fundamentally different:

LLMs want structured instructions with roles, constraints, and format
Images want descriptive sentences with visual attributes
Video wants cinematography terminology plus audio components

Understanding this difference gives you significantly better results from each type of model.

Blog / Generative ai / Prompting Text AI vs Image AI: Totally D…

← All notes

Prompting Text AI vs Image AI: Totally Different Games

The fundamental differences between prompts for LLMs and generative AI for images and video.

Ilia Ilinskii
Rephrase · Dec 23, 2025

Generative ai12 min read

On this page

Here's something that tripped me up when I started generating images. I thought writing prompts was writing prompts. Same skill, different tool. Nope.

Let me break it down.

Two Different Worlds

Language Models (LLMs)

These are trained on text to generate, analyze, and process text.

Examples: ChatGPT, Claude, Gemini, Llama, Mistral

Typical tasks: writing, analysis, code, Q&A, summarization

Generative Models (Images, Video)

These create visual content from text descriptions.

For images: DALL-E 3, Midjourney V7, Stable Diffusion 3.5, Flux, Ideogram, GPT-4o Image, Nano Banana

For video: Sora 2, Veo 3, Runway Gen-4, Kling 2.6, Pika, Luma

The Core Difference

LLM Prompts: Instructions and Structure

Element	Purpose
Role	Sets expertise and style
Instructions	Step-by-step what to do
Context	Background info and data
Constraints	What NOT to do
Output format	Structure of the response

LLM prompt example:

<role>You are a senior marketing analyst</role>
<instructions>
Analyze the campaign data and provide:
1. Key metrics summary
2. Performance trends
3. Recommendations
</instructions>
<constraints>
- Use only provided data
- Be concise (max 500 words)
</constraints>
<data>{{CAMPAIGN_DATA}}</data>

Image Prompts: Descriptions and Visual Attributes

Element	Purpose
Style	Artistic aesthetic
Subject	Main object/character
Setting	Environment and context
Lighting	Light sources and mood
Composition	Angle and framing
Technical	Resolution, aspect ratio

Image prompt example:

A photorealistic portrait of an elderly Japanese ceramicist
with deep, sun-etched wrinkles and a warm, knowing smile.
Natural window light from the left, shallow depth of field,
neutral background. Serene and masterful mood.

See the difference? One is giving orders. The other is painting a picture with words.

Quick Comparison

Aspect	LLMs	Image/Video Models
Format	Structured (XML/Markdown)	Descriptive text
Keywords vs sentences	Full sentences	Depends on model*
Negative instructions	`<constraints>` tags	Negative prompts
Iteration	Dialogue and refinement	Rerolls and variations
Examples	Text examples	Reference images
Length control	Specified in instructions	Not applicable
Style control	Tone and format	Artistic aesthetic

*Modern models like Midjourney V6+, Flux, Nano Banana prefer full descriptive sentences over keyword lists.

Writing Image Prompts

Basic Structure

[Style/Aesthetic] + [Subject] + [Setting] + [Lighting] + [Composition] + [Technical]

The Big Shift: Sentences Over Keywords

Older approach (doesn't work as well anymore):

woman, red dress, cafe, morning, coffee, vintage, 4k, award winning

Modern approach (works much better):

A young woman in a flowing crimson dress sits at a Parisian sidewalk cafe,
her fingers wrapped around a steaming espresso cup as golden morning light
filters through the awning, creating soft shadows on the vintage iron table.

Modern models - especially Midjourney V6+, Flux, and Nano Banana - understand descriptive sentences much better than keyword lists.

Platform-Specific Examples

DALL-E 3 / GPT-4o Image

A high-resolution, studio-lit product photograph of a minimalist ceramic
coffee mug in matte black, presented on a polished concrete surface.
Soft diffused lighting from above, subtle shadow, clean background.
Square image.

Midjourney V7

Haute-couture advertising campaign photographed by Erik Madigan Heck.
Two models wearing Comme des Garcons Avant-Garde costume.
Mongol steppe in background. Northern lights in sky --ar 2:3 --v 7

Midjourney has special parameters:

--ar 2:3 - aspect ratio
--v 7 - model version
--cref [URL] - character reference
--sref [URL] - style reference

Stable Diffusion 3.5

Positive: majestic lion with golden mane, hyperrealistic, 8K, detailed fur
Negative: blurry, low quality, distorted, bad anatomy, extra fingers

SD's superpower: full negative prompts and prompt weighting with (keyword:1.5).

Flux

A hyperrealistic portrait of a weathered sailor in his 60s,
with deep-set blue eyes, a salt-and-pepper beard, and sun-weathered skin.
He's wearing a faded blue captain's hat and a thick wool sweater.
The background shows a misty harbor at dawn.

Flux uses dual-encoder (T5 + CLIP) and has the best text rendering in the industry.

Writing Video Prompts

Basic Structure

[CAMERA/SHOT] + [SUBJECT] + [ACTION] + [ENVIRONMENT] + [STYLE] + [AUDIO]

Components

Subject - who/what is in focus
Context - where the action happens
Action - what the subject does
Style - visual aesthetic
Camera - shot type and movement
Lighting - mood and atmosphere
Audio - sound effects, dialogue (for Sora 2, Veo 3)

Platform Examples

Sora 2 (OpenAI)

Style: Hand-painted 2D/3D hybrid animation with soft brush textures.

Inside a cluttered workshop, a small round robot sits on a wooden bench.

Cinematography:
Camera: medium close-up, slow push-in with gentle parallax
Lens: 35mm virtual lens; shallow depth of field
Lighting: warm key from overhead; cool spill from window

Actions:
- The robot taps the bulb; sparks crackle.
- It flinches, dropping the bulb.
- Robot says: "Almost lost it... but I got it!"

What I learned about Sora:

Short clips (4 sec) are more stable than long ones
One camera move per shot
Dialogue goes in a separate block

Veo 3 (Google)

Camera: Medium shot, slow push-in
Subject: A seasoned grey-bearded man in sunglasses and paisley shirt
Setting: Vibrant mural wall background
Audio: Faint city murmurs, distant chatter, mellow soulful hip-hop beat
Dialogue: [Character says: "This is the moment..."]

Veo 3 generates audio natively - describe sounds in separate sentences.

Kling 2.6

A static shot of a burger as it assembles in mid-air.
The entire shot is in dramatic slow-motion.
Background is a clean professional studio gradient.
Style: TV food commercial
++sleek red convertible++

Kling uses ++keyword++ to emphasize important elements.

Handling "Don't Do This" Instructions

LLMs: Constraints in Structure

<constraints>
- Do not include personal opinions
- Do not exceed 500 words
- Do not use technical jargon
</constraints>

Images: Negative Prompts

Stable Diffusion:

Negative: blurry, low quality, distorted, bad anatomy, extra fingers,
watermark, text, signature

Midjourney:

--no text, watermark, blurry background

Semantic negatives (Nano Banana, GPT-4o Image):

No extra fingers or hands; no text except the title;
avoid watermarks; avoid clutter; no background distractions.

Video: Exclusions

Avoid Dutch angles; no on-screen text; no lens flare;
no subtitle overlays; no watermarks.

When to Use What

Use LLMs for:

Text analysis and processing
Content generation (articles, posts, emails)
Programming and code review
Q&A and research
Document summarization
Translation and localization

Use Image Generation for:

Marketing visuals
Concept art and illustrations
Product mockups
Social media content
Stickers and icons
Infographics

Use Video Generation for:

Short promo clips
Product videos
Social media content
B-roll footage
Animated concepts
Music visualizations

Platform Comparison Tables

Image Generation

Platform	Text in Image	Prompt Adherence	Negative Prompts	Best For
DALL-E 3	Okay	Good	None	General tasks
Midjourney V7	Okay	Good	`--no`	Artistic quality
Stable Diffusion 3.5	Good	Good	Full support	Customization
Flux	Excellent	Excellent	Limited	Text, realism
Ideogram	Excellent	Good	Limited	Typography
GPT-4o Image	Excellent	Good	Semantic	Conversational editing
Nano Banana	Good	Good	Semantic	Speed, editing

Video Generation

Platform	Duration	Audio	Physics	Best For
Sora 2	10-20 sec	Excellent	Excellent	Complex scenes
Veo 3.1	4-8 sec	Excellent	Good	Native audio
Runway Gen-4	10 sec	Okay	Okay	Image-to-video
Kling 2.6	5-10 sec	Good	Good	Lip-sync

Tips for Both

For LLMs

Structure your prompt with XML or Markdown
Set a role for expertise and style
Be explicit, especially for Claude
Use few-shot examples for complex tasks
Iterate through dialogue

For Generative Models

Describe, don't list keywords
Always specify lighting - it dramatically affects results
Use reference images when available (cref, sref)
Add negative prompts to exclude unwanted elements
Experiment with variations - each generation is unique

Universal Principles

Specificity matters everywhere - precise descriptions get better results
Know your platform's quirks - each model is different
Iterate and improve - first prompt is rarely perfect
Study examples - see what works for others

The Takeaway

LLM prompts and image/video prompts are fundamentally different:

LLMs want structured instructions with roles, constraints, and format
Images want descriptive sentences with visual attributes
Video wants cinematography terminology plus audio components

Understanding this difference gives you significantly better results from each type of model.