Blog / Video generation / How to Prompt Veo, Kling, Runway, and So…

How to Prompt Veo, Kling, Runway, and Sora

Learn how to prompt Veo 3.1, Kling 3.0, Runway Gen-4.5, and Sora 2 with the right mental model for each. See examples inside.

Ilia Ilinskii
Rephrase · April 27, 2026

Video generation8 min read

On this page

Key Takeaways What does "four mental models" actually mean?How should you prompt Veo 3.1?Before → after for Veo How should you prompt Kling 3.0?A better structure for Kling How should you prompt Runway Gen-4.5?Before → after for Runway How should you prompt Sora 2?Why do these video prompting differences matter?Try this four-part workflow References

Most people treat video prompting like text prompting with prettier output. That's the mistake. Four top video models can all make a clip from one sentence, but they do not "think" about your request in the same way.

Key Takeaways

Veo 3.1 responds best when you describe physical movement and scene continuity like a simulator.
Kling 3.0 is strongest when you prompt like an editor giving targeted transformation instructions.
Runway Gen-4.5 works best when you think in shots, coverage, and balanced creative constraints.
Sora 2 benefits from cinematic worldbuilding, but still needs concrete control over motion and sequence.
If you want better first drafts fast, tools like Rephrase can turn rough ideas into model-specific prompts in seconds.

What does "four mental models" actually mean?

A mental model is the framing you use before writing the prompt, and it matters because video systems do not fail in the same way. Research on video models shows a recurring gap between visual plausibility and instruction fidelity, plus common failure modes like instruction neglect, scene drift, and reward hacking [1][2].

Here's my practical read: if you use the same prompt skeleton everywhere, you're forcing the model to translate your intent instead of helping it execute. That extra translation step is where weird motion, ignored constraints, and muddy edits show up.

Model	Best mental model	What to emphasize	Common failure risk
Veo 3.1	World simulator	Physics, trajectory, continuity, timing	Constraint neglect, brittle multi-step execution
Kling 3.0	Video editor	Specific edits, attributes, effects, locality	Over-editing or weak text following
Runway Gen-4.5	Shot director	Shot intent, balance, style, camera language	Prompt variance across tasks
Sora 2	Cinematic scene engine	Worldbuilding, sequencing, visual logic	Surreal drift, vague action execution

How should you prompt Veo 3.1?

Veo 3.1 works best when you describe the video as a physically unfolding scene with explicit motion rules. In research settings, Veo-like prompting improved when prompts specified viewpoint, speed, timing, stop conditions, and environmental consistency instead of just describing the subject [1][3].

I think of Veo as a trajectory model. Don't just say what should happen. Say how the camera or subject moves through space.

Bad prompt:

A drone flies around a tree in a cinematic way.

Better Veo-style prompt:

First-person aerial view. The camera performs a smooth 360-degree orbit around a single green tree at constant height and speed. Keep the tree centered in frame. No vertical bobbing, no sudden yaw changes, no camera shake. The environment remains perfectly consistent from start to end.

Here's what I noticed from the literature: detailed prompts help because video models often fake success visually. NavDreamer documents cases where models simulate progress with zoom or hallucinated objects instead of real motion [1]. So with Veo, add guardrails like "no zoom," "fixed height," "last frame returns to start," or "background remains unchanged."

Before → after for Veo

Before	After
"Move fast to the tree."	"In the first 3 seconds, the camera accelerates smoothly toward the tree with forward motion only, no zoom. At exactly 3 seconds, stop one meter in front of the tree and remain fully stationary for the next 2 seconds."

How should you prompt Kling 3.0?

Kling 3.0 works best when you prompt it like a high-end editing system, not a blank-sheet generator. Benchmark results on video editing show Kling 3.0 performs especially well across quantity, attribute, instance, and visual-effect editing, which tells me it likes targeted transformation language more than vague cinematic prose [2].

So I treat Kling as an instruction-following editor. The best prompts specify the original scene, the exact change, and what must stay untouched.

Instead of writing like a filmmaker, write like a post-production lead:

Keep the original camera angle and subject framing. Replace the rainy city background with a neon cyberpunk street at night. Preserve the person's face, pose, clothing silhouette, and hand motion. Add purple and cyan reflections on the jacket only. Do not alter the foreground proportions or facial features.

That "do not alter" line matters. VEFX-Bench separates instruction following, rendering quality, and edit exclusivity, and that's a useful way to think about Kling prompts too [2]. If you want a localized edit, say so explicitly.

A better structure for Kling

Use this order: source scene, target change, preserved elements, style/effect, exclusions.

If you want more examples in this style, the Rephrase blog is a good place to browse prompt breakdowns for different AI workflows.

How should you prompt Runway Gen-4.5?

Runway Gen-4.5 works best when you think in terms of shot design. Benchmarks suggest it's relatively balanced across editing tasks, even if not always the absolute leader in every category [2]. That balance is a clue: Runway tends to reward prompts that define the shot clearly without over-constraining every pixel.

I think of Runway as a creative director with a storyboard brain. Give it shot purpose, visual style, pacing, and a few must-have constraints. Don't overload it with microscopic instructions unless the shot truly needs them.

A strong Runway prompt sounds like this:

Medium close-up of a chef plating tacos in a warm, cinematic kitchen. Slow handheld push-in over 4 seconds. Steam rising naturally from the food. Shallow depth of field, soft practical lighting, realistic hand motion, no text overlays, no extra utensils appearing.

What works well here is the balance. You're defining framing, motion, atmosphere, and failure boundaries. You're not narrating every frame.

Before → after for Runway

Before	After
"A taco ad in a cool style."	"30-degree angled product shot of three tacos on a wooden counter. Slow dolly-in, warm commercial lighting, crisp texture detail, subtle steam, shallow depth of field, high-end food ad aesthetic, no surreal ingredients, no plate movement."

That's the difference between "make something nice" and "shoot this shot."

How should you prompt Sora 2?

Sora 2 works best when you combine cinematic worldbuilding with explicit sequencing. Research comparing frontier multimodal and video models suggests Sora-class systems can produce strong, creative visuals, but they still struggle when prompts rely on implicit logic or under-specified motion [3].

I treat Sora as a scene engine. It wants a vivid setup, but it still needs structure. The trick is to give it atmosphere without sacrificing action clarity.

A better Sora prompt usually has four parts: scene, subject, action sequence, camera behavior.

A quiet convenience store at 2 a.m., lit by flickering fluorescent lights and a humming drink cooler. A tired cashier looks up as a black cat jumps onto the counter. The cat walks across the register, pauses, then knocks a receipt roll to the floor. Static wide shot for 2 seconds, then a slow push-in as the cashier reacts. Maintain realistic object positions and consistent store layout.

That last sentence is not fluff. MentisOculi found that even advanced multimodal systems can generate convincing visuals while failing to maintain usable internal consistency across steps [3]. In plain English: make the sequence explicit.

Why do these video prompting differences matter?

Different video models optimize for different strengths, and benchmarks keep showing the same pattern: nice-looking output is easier than faithful output [1][2]. If you don't adapt your prompt to the model's bias, you get clips that look impressive for three seconds and fall apart on review.

My rule is simple. Use Veo for physical choreography. Use Kling for targeted transformations. Use Runway for shot-driven creative work. Use Sora for cinematic scene construction with clear sequence control.

That also explains why generic "write me a video prompt" tools often disappoint. The useful ones don't just rewrite grammar. They reframe intent for the model. That's exactly where Rephrase is handy on macOS: you hit the hotkey in any app, and it rewrites your rough idea into a prompt style that better matches the task.

Try this four-part workflow

The fastest way to improve is to write one concept four ways. Start with the same scene, then rewrite it as trajectory, edit, shot, and cinematic sequence. You'll immediately feel which models want control language and which want visual staging.

That exercise is more valuable than memorizing "best prompt formulas," because it teaches you to think like the model before you type.

References

Documentation & Research

NavDreamer: Video Models as Zero-Shot 3D Navigators - arXiv / The Prompt Report (link)
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects - arXiv / The Prompt Report (link)
MentisOculi: Revealing the Limits of Reasoning with Mental Imagery - arXiv / The Prompt Report (link)

Community Examples 4. Are these video model generators THAT different? - r/PromptEngineering (link)

Frequently asked

Do different video models really need different prompts?

Yes. Even when two models accept the same plain-English input, they often respond best to different kinds of structure. Some follow cinematic direction well, while others respond better to explicit motion, editing, or effect language.

Why do video prompts fail even when they sound clear?

Because video models often optimize for visual plausibility before precise instruction following. Research shows they can drift, hallucinate, or ignore constraints unless you specify motion, timing, framing, and scene consistency clearly.