The fundamental differences between prompts for LLMs and generative AI for images and video.
Here's something that tripped me up when I started generating images. I thought writing prompts was writing prompts. Same skill, different tool. Nope.
Prompts for ChatGPT and prompts for Midjourney require completely different thinking. One wants structure and instructions. The other wants descriptions and vibes. Once I understood this, my results got way better.
Let me break it down.
These are trained on text to generate, analyze, and process text.
Examples: ChatGPT, Claude, Gemini, Llama, Mistral
Typical tasks: writing, analysis, code, Q&A, summarization
These create visual content from text descriptions.
For images: DALL-E 3, Midjourney V7, Stable Diffusion 3.5, Flux, Ideogram, GPT-4o Image, Nano Banana
For video: Sora 2, Veo 3, Runway Gen-4, Kling 2.6, Pika, Luma
| Element | Purpose |
|---|---|
| Role | Sets expertise and style |
| Instructions | Step-by-step what to do |
| Context | Background info and data |
| Constraints | What NOT to do |
| Output format | Structure of the response |
LLM prompt example:
<role>You are a senior marketing analyst</role>
<instructions>
Analyze the campaign data and provide:
1. Key metrics summary
2. Performance trends
3. Recommendations
</instructions>
<constraints>
- Use only provided data
- Be concise (max 500 words)
</constraints>
<data>{{CAMPAIGN_DATA}}</data>
| Element | Purpose |
|---|---|
| Style | Artistic aesthetic |
| Subject | Main object/character |
| Setting | Environment and context |
| Lighting | Light sources and mood |
| Composition | Angle and framing |
| Technical | Resolution, aspect ratio |
Image prompt example:
A photorealistic portrait of an elderly Japanese ceramicist
with deep, sun-etched wrinkles and a warm, knowing smile.
Natural window light from the left, shallow depth of field,
neutral background. Serene and masterful mood.
See the difference? One is giving orders. The other is painting a picture with words.
| Aspect | LLMs | Image/Video Models |
|---|---|---|
| Format | Structured (XML/Markdown) | Descriptive text |
| Keywords vs sentences | Full sentences | Depends on model* |
| Negative instructions | <constraints> tags |
Negative prompts |
| Iteration | Dialogue and refinement | Rerolls and variations |
| Examples | Text examples | Reference images |
| Length control | Specified in instructions | Not applicable |
| Style control | Tone and format | Artistic aesthetic |
*Modern models like Midjourney V6+, Flux, Nano Banana prefer full descriptive sentences over keyword lists.
[Style/Aesthetic] + [Subject] + [Setting] + [Lighting] + [Composition] + [Technical]
Older approach (doesn't work as well anymore):
woman, red dress, cafe, morning, coffee, vintage, 4k, award winning
Modern approach (works much better):
A young woman in a flowing crimson dress sits at a Parisian sidewalk cafe,
her fingers wrapped around a steaming espresso cup as golden morning light
filters through the awning, creating soft shadows on the vintage iron table.
Modern models - especially Midjourney V6+, Flux, and Nano Banana - understand descriptive sentences much better than keyword lists.
A high-resolution, studio-lit product photograph of a minimalist ceramic
coffee mug in matte black, presented on a polished concrete surface.
Soft diffused lighting from above, subtle shadow, clean background.
Square image.
Haute-couture advertising campaign photographed by Erik Madigan Heck.
Two models wearing Comme des Garcons Avant-Garde costume.
Mongol steppe in background. Northern lights in sky --ar 2:3 --v 7
Midjourney has special parameters:
--ar 2:3 - aspect ratio--v 7 - model version--cref [URL] - character reference--sref [URL] - style referencePositive: majestic lion with golden mane, hyperrealistic, 8K, detailed fur
Negative: blurry, low quality, distorted, bad anatomy, extra fingers
SD's superpower: full negative prompts and prompt weighting with (keyword:1.5).
A hyperrealistic portrait of a weathered sailor in his 60s,
with deep-set blue eyes, a salt-and-pepper beard, and sun-weathered skin.
He's wearing a faded blue captain's hat and a thick wool sweater.
The background shows a misty harbor at dawn.
Flux uses dual-encoder (T5 + CLIP) and has the best text rendering in the industry.
[CAMERA/SHOT] + [SUBJECT] + [ACTION] + [ENVIRONMENT] + [STYLE] + [AUDIO]
Style: Hand-painted 2D/3D hybrid animation with soft brush textures.
Inside a cluttered workshop, a small round robot sits on a wooden bench.
Cinematography:
Camera: medium close-up, slow push-in with gentle parallax
Lens: 35mm virtual lens; shallow depth of field
Lighting: warm key from overhead; cool spill from window
Actions:
- The robot taps the bulb; sparks crackle.
- It flinches, dropping the bulb.
- Robot says: "Almost lost it... but I got it!"
What I learned about Sora:
Camera: Medium shot, slow push-in
Subject: A seasoned grey-bearded man in sunglasses and paisley shirt
Setting: Vibrant mural wall background
Audio: Faint city murmurs, distant chatter, mellow soulful hip-hop beat
Dialogue: [Character says: "This is the moment..."]
Veo 3 generates audio natively - describe sounds in separate sentences.
A static shot of a burger as it assembles in mid-air.
The entire shot is in dramatic slow-motion.
Background is a clean professional studio gradient.
Style: TV food commercial
++sleek red convertible++
Kling uses ++keyword++ to emphasize important elements.
<constraints>
- Do not include personal opinions
- Do not exceed 500 words
- Do not use technical jargon
</constraints>
Stable Diffusion:
Negative: blurry, low quality, distorted, bad anatomy, extra fingers,
watermark, text, signature
Midjourney:
--no text, watermark, blurry background
Semantic negatives (Nano Banana, GPT-4o Image):
No extra fingers or hands; no text except the title;
avoid watermarks; avoid clutter; no background distractions.
Avoid Dutch angles; no on-screen text; no lens flare;
no subtitle overlays; no watermarks.
| Platform | Text in Image | Prompt Adherence | Negative Prompts | Best For |
|---|---|---|---|---|
| DALL-E 3 | Okay | Good | None | General tasks |
| Midjourney V7 | Okay | Good | --no |
Artistic quality |
| Stable Diffusion 3.5 | Good | Good | Full support | Customization |
| Flux | Excellent | Excellent | Limited | Text, realism |
| Ideogram | Excellent | Good | Limited | Typography |
| GPT-4o Image | Excellent | Good | Semantic | Conversational editing |
| Nano Banana | Good | Good | Semantic | Speed, editing |
| Platform | Duration | Audio | Physics | Best For |
|---|---|---|---|---|
| Sora 2 | 10-20 sec | Excellent | Excellent | Complex scenes |
| Veo 3.1 | 4-8 sec | Excellent | Good | Native audio |
| Runway Gen-4 | 10 sec | Okay | Okay | Image-to-video |
| Kling 2.6 | 5-10 sec | Good | Good | Lip-sync |
LLM prompts and image/video prompts are fundamentally different:
Understanding this difference gives you significantly better results from each type of model.