Prompt TipsFeb 26, 202610 min

AI Image Prompt Formulas for Lighting, Style, and Composition (That Actually Hold Up)

A practical way to write image prompts like a cinematographer: lock composition early, specify lighting like a rig, and treat style as constraints-not vibes.

AI Image Prompt Formulas for Lighting, Style, and Composition (That Actually Hold Up)

Most "bad" AI image prompts aren't bad because they're uncreative. They're bad because they're underspecified in the exact places the model can't infer reliably: lighting intent, compositional intent, and which parts of "style" are non-negotiable.

If you've ever typed something like "cinematic portrait, dramatic lighting" and got a random, glossy, center-framed image with lighting that feels like a video game cutscene, you've already seen the problem. The model didn't fail. You gave it permission to guess.

Here's my take: the best image prompts read less like poetry and more like a shot plan. And there's research backing why this works: diffusion systems tend to "lock in" composition early, then refine style and texture later. That means composition and spatial constraints have to be explicit and up front, or you'll fight the model every iteration. Tinaz et al. show this pretty directly: composition emerges extremely early in diffusion, while style interventions are more effective mid-process, and late steps mostly tweak texture details [2]. Different architecture, same lived experience: if you don't specify composition early, you end up "editing" the prompt forever.

Let's build a set of prompt formulas that reflect that reality.


The three-layer formula: Composition → Lighting → Style

I like prompts that stack constraints in this order:

  1. Composition (what's in frame, where it sits, and how the "camera" sees it)
  2. Lighting (the rig: direction, softness, contrast, color, and atmosphere)
  3. Style (the rendering contract: medium + references + grading + texture rules)

Why this order? Because composition is the hardest thing to "fix later." Once the model decides where subjects live in the frame, you can often nudge mood and finish, but re-blocking the scene is painful. That matches interpretability findings in diffusion: early steps strongly shape scene layout; later steps are about refinement [2].

And there's another "hidden" reason to be structured: prompt semantics can degrade through the denoising stack in some modern multimodal diffusion transformer setups. Yao et al. describe prompt forgetting-fine-grained prompt information becomes less recoverable in deeper layers, especially for spatial relations [1]. Translation: if your prompt is a soup of adjectives, some constraints simply won't survive.

So we'll write prompts like we expect constraints to get dropped unless they're clear.


Composition formula: write like a camera operator

Composition is where you tell the model what you'd tell a photographer: lens, distance, angle, framing, and background behavior.

Use this structure:

[Subject + key visual traits], [action/pose], in [setting]. 
Camera: [shot type], [lens], [camera height/angle]. 
Composition: [subject placement], [foreground/midground/background], [depth of field], [aspect ratio].

A few things I've noticed help disproportionately:

First, treat lens choice as a constraint, not an aesthetic. "85mm portrait lens" is more actionable than "professional portrait." Communities have converged on the same trick: quantified camera parameters tend to steer better than vibes like "high-end." It's practical, not mystical [5].

Second, spatial language needs to be boring and explicit. "Subject on the left third, looking into negative space on the right" is far better than "dynamic composition." Spatial constraints are exactly the kind of detail that can get lost (or misinterpreted) in deeper layers [1].

Third, don't forget the background policy. Most generators love clutter. If you don't specify "minimal props" or "clean background," you'll get set dressing you never asked for.


Lighting formula: describe a rig, not a mood

Lighting is the difference between "an image of a person" and "a shot."

Here's the lighting structure I use:

Lighting: [key light direction + quality], [fill level], [rim/back light], [shadow softness], 
[practical sources], [atmosphere], [color temperature / palette], [grade].

Examples of rig-level phrasing that tends to work:

"soft key from camera-left through diffusion"
"low fill, deep shadows, controlled contrast"
"subtle rim light outlining shoulders"
"volumetric haze catching highlights"

This matches what systems built for filmmakers expose as controls: PrevizWhiz, for example, explicitly separates style/genre from controllable adherence to color and lighting, and treats "how closely lighting and composition adhere" as a first-class knob [3]. That's a good mental model for prompting too: lighting isn't a vibe. It's a set of knobs.

One caution: lighting words can also smuggle in demographic and aesthetic defaults. If you're generating people, "photorealistic" plus defaults can still drift into biased outcomes depending on the model. A recent audit found "neutral" prompts like "a person, photorealistic" still produced overwhelmingly white defaults and model-dependent gender skews [4]. So if representation matters, specify it. Don't assume neutrality does anything.


Style formula: a contract, not a collage

"Style" is where most people overdo it. They pile on: cinematic, ultra-detailed, 8k, trending on ArtStation, award-winning… and then wonder why outputs feel random.

Instead, write style as a contract with two parts:

  1. Medium & references (what pipeline it should resemble)
  2. Material/finish rules (what realism cues must be present)

A clean style line looks like:

Style: [medium], [era/genre], [reference cues], [texture/film grain], [color grade].
Materials: [skin texture rule], [fabric rule], [surface rule].

Notice what's missing: vague praise words. "Beautiful," "stunning," "epic." Those don't constrain behavior.


Practical examples (with full prompts)

Below are a few prompts you can paste directly into your image model and iterate. The structure is inspired by community templates that emphasize "describe a shot, not a thing," with explicit lighting and camera blocks [5]. I'm keeping them original, but the underlying workflow is the same: block → light → style → negatives.

Example 1: cinematic product hero shot

Subject: a matte black wireless headphone set with subtle brushed-metal accents.
Action/Context: resting on a dark walnut desk, angled slightly toward camera.
Setting: minimalist studio tabletop, no clutter, seamless dark background.

Camera: medium close-up product shot, 85mm lens equivalent, eye-level with product.
Composition: headphones on the left third, large negative space on the right for copy, shallow depth of field, 4:5 aspect ratio.

Lighting: soft key from camera-left through a large diffuser, gentle fill from camera-right at low intensity, thin rim light from behind to separate edges, soft shadows, controlled specular highlights, no blown-out reflections.
Color: neutral-to-cool grade, deep blacks with preserved detail.

Style: high-end commercial photography, realistic materials, crisp micro-contrast, subtle film grain.
Constraints: no text, no watermark, no logos, no extra objects, no distortion, no melted geometry, no overbloom.

Example 2: portrait with specific lighting

Subject: a middle-aged woman with short curly hair and natural skin texture.
Pose: seated, shoulders relaxed, looking slightly past camera, calm expression.
Setting: quiet interior, plain textured wall background.

Camera: tight portrait, 90mm lens equivalent, chest-up framing, slight downward angle.
Composition: centered face with a little headroom, background softly blurred, 3:2 aspect ratio.

Lighting: window light from camera-right as soft key, negative fill on camera-left for sculpted shadows, subtle rim light to define hair, soft shadow edges, warm highlights and cooler shadows.
Color: gentle warm grade, natural tones, no neon saturation.

Style: photoreal portrait photography, documentary feel, realistic eyes and skin pores.
Constraints: no plastic skin, no cartoon look, no extra fingers, no distorted facial proportions, no text.

Example 3: scene composition you actually control

Subject: a cyclist in a yellow rain jacket.
Action: riding through a crosswalk, water splashing from the wheels.
Setting: rainy city street at dusk, distant storefronts softly glowing.

Camera: wide establishing shot, 24mm lens equivalent, low camera height near street level.
Composition: cyclist entering from bottom-left toward center, strong leading lines from crosswalk stripes, background buildings in soft focus, 16:9 aspect ratio.

Lighting: dusk ambient with wet reflections, practical storefront light as warm accents, soft top light from overcast sky, subtle volumetric mist, controlled contrast so details remain visible.
Color: teal-and-amber-ish grade but restrained, realistic reflections.

Style: cinematic live-action still frame, realistic rain physics, natural motion blur.
Constraints: no floating objects, no unreadable gibberish signs, no watermarks, no warped bike frame.

The iteration rule I follow: change one lever per run

If you change composition, lighting, and style all at once, you won't learn what caused the improvement. The fastest way to get consistent results is to treat your prompt like a configuration file: one edit, one reroll, compare.

This also aligns with what we see in diffusion behavior: early constraints set the layout; later changes tend to polish. So if the framing is wrong, don't "add more style." Fix camera/composition first. If the framing is right but the vibe is off, adjust lighting before you touch style.

And if you're building product workflows, consider a more formal loop where you iteratively optimize prompts based on preference feedback. There's active research on preference-guided prompt optimization for text-to-image, specifically aimed at reducing manual trial-and-error [6]. Even if you don't implement the algorithm, the mindset is useful: treat prompting as optimization, not inspiration.


Closing thought: prompts are production notes

The best mental model I've found is simple: you're not writing a prompt. You're writing production notes for a scene. Composition is blocking. Lighting is the rig. Style is post.

Do that, and the model stops guessing.


References

Documentation & Research

  1. Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers - arXiv (The Prompt Report) - http://arxiv.org/abs/2602.06886v1
  2. Emergence and Evolution of Interpretable Concepts in Diffusion Models - arXiv - https://arxiv.org/abs/2504.15473
  3. PrevizWhiz: Combining Rough 3D Scenes and 2D Video to Guide Generative Video Previsualization - arXiv (The Prompt Report) - http://arxiv.org/abs/2602.03838v1
  4. Neutral Prompts, Non-Neutral People: Quantifying Gender and Skin-Tone Bias in Gemini Flash 2.5 Image and GPT Image 1.5 - arXiv - https://arxiv.org/abs/2602.12133
  5. Preference-Guided Prompt Optimization for Text-to-Image Generation - arXiv (The Prompt Report) - http://arxiv.org/abs/2602.13131v1

Community Examples

  1. Here is the ChatGPT image prompt template you can use to make your AI Images look awesome - r/ChatGPTPromptGenius - https://www.reddit.com/r/ChatGPTPromptGenius/comments/1qms4bf/here_is_the_chatgpt_image_prompt_template_you_can/
  2. After analyzing 1,000+ viral prompts, I made a system prompt that auto-generates pro-level NanoBanana prompts - r/PromptEngineering - https://www.reddit.com/r/PromptEngineering/comments/1qq4tet/after_analyzing_1000_viral_prompts_i_made_a/
Ilia Ilinskii
Ilia Ilinskii

Founder of Rephrase-it. Building tools to help humans communicate with AI.

Related Articles