Blog / Video generation / How to Use Cinematography Terms in Promp…

How to Use Cinematography Terms in Prompts

Learn how to use cinematography vocabulary in AI video prompts for Veo, Sora, and Kling so shots land cleaner and more intentional. Try free.

Ilia Ilinskii
Rephrase · April 30, 2026

Video generation7 min read

On this page

Key Takeaways Why does cinematography vocabulary improve AI video prompts?What belongs in a director's glossary for Veo, Sora, and Kling?How should you structure a cinematography-first prompt?Which cinematography terms are most worth learning first?What do better Veo, Sora, and Kling prompts look like in practice?What is the biggest mistake people make with cinematic prompts?References

Most AI video prompts fail for a boring reason: we ask for "cinematic," but we never say what the camera should actually do.

Key Takeaways

Cinematography vocabulary works like a control language, not decoration.
Precise terms such as shot size, angle, motion, and focus reduce prompt ambiguity.
Research on video understanding and camera control shows models do better with structured, explicit motion primitives.[1][2]
"Cinematic" is too vague on its own; "slow dolly in to a medium close-up with shallow depth of field" is usable.
For faster iteration, tools like Rephrase can turn rough creative intent into tighter video prompts in seconds.

Why does cinematography vocabulary improve AI video prompts?

Cinematography vocabulary improves AI video prompts because it converts fuzzy creative intent into observable, time-based instructions. Recent research shows that models struggle with implicit camera understanding and respond better when motion, framing, and shot structure are described with explicit primitives and consistent terminology.[1][2]

Here's the core idea: film language is a compression format for visual intent. Directors don't tell a cinematographer, "make it cool." They say, "start wide, dolly in, hold eye level, then rack focus to the foreground." That language exists because it removes ambiguity. AI video models need the same thing.

This is not just taste. In Building a Precise Video Language with Human-AI Oversight, researchers argue that weak terminology leads to missing information, inconsistent captions, and misuse of terms like bird's-eye view, close-up, and zoom.[1] In Geometry-Guided Camera Motion Understanding in VideoLLMs, the authors show that models are notably weak at fine-grained camera motion recognition and improve when explicit motion labels are injected into prompts as a structured header.[2]

So if you're prompting Veo, Sora, or Kling, the win is simple: stop writing vibes first. Write camera behavior first.

What belongs in a director's glossary for Veo, Sora, and Kling?

A useful director's glossary for AI prompting should cover the controllable parts of a shot: shot size, angle, motion, framing, lens, focus, and pacing. These categories map better to how video models interpret scenes over time than broad adjectives like cinematic, dramatic, or beautiful.[1][3]

I like to think in seven buckets.

Shot size tells the model how close we are: extreme wide shot, wide shot, medium shot, medium close-up, close-up. This matters because a "close-up of trembling hands" is a totally different generation problem from "wide shot in a stormy parking lot."

Angle defines viewpoint: eye level, low angle, high angle, overhead, Dutch angle. This changes power, geometry, and subject emphasis.

Camera movement is where a lot of prompts break. Pan, tilt, truck, dolly, crane, roll, static. These are not interchangeable. Research-backed taxonomies treat them as distinct motion primitives for a reason.[2][3]

Lens and depth cover wide lens, telephoto, shallow depth of field, deep focus, fisheye distortion. These terms affect perceived space, not just style.

Focus behavior matters more than most people realize: rack focus, focus pull, foreground sharp / background soft. Recent work on precise video language calls out focus changes as details many datasets miss.[1]

Framing and screen position help stabilize composition: centered subject, off-center left, symmetrical framing, silhouette in background, subject in lower third. VERTIGO also evaluates prompts using composition dimensions like shot scale, shot angle, and screen position, which is a strong clue that these categories matter for control.[3]

Time structure helps the model stage the shot: opens with, midway through, then, finally. AI video often gets better when your prompt has sequence logic instead of one big blob.

How should you structure a cinematography-first prompt?

A cinematography-first prompt should move from subject and scene into shot design, then describe how the camera changes over time. Structured prompting works because models handle temporally ordered, physically grounded instructions more reliably than dense paragraphs of mixed style language.[2]

Here's the simple template I use:

Subject + action + setting + shot size + angle + camera movement + lens/focus + lighting/style + time progression

And here's a clean before-and-after.

Prompt type	Example
Before	"A cinematic scene of a woman in a city at sunset, dramatic and emotional."
After	"A woman in her 30s stands alone on a rooftop at sunset, city skyline behind her. Medium close-up at eye level. Slow dolly in as she turns toward camera. Shallow depth of field, warm rim light, soft key from frame left, subtle film grain. Hold steady for the first two seconds, then rack focus from skyline to her face."

The second prompt is longer, but it's also more disciplined. Every phrase does a job.

That matches what I noticed in the research. The motion-injection work found that filmmaker-style prompts already help, but adding explicit motion primitives makes descriptions more directionally correct and temporally coherent.[2] In plain English: "tracking forward" is decent, but "dolly in while rolling clockwise" is better if that's what you actually want.

Which cinematography terms are most worth learning first?

The highest-value cinematography terms are the ones that eliminate common ambiguities: dolly versus zoom, pan versus truck, high angle versus overhead, and close-up versus medium close-up. Learning these first gives you more control than memorizing dozens of fancy film-school labels.[1][2]

If you only learn twelve terms, make them these: static, pan, tilt, truck, dolly in, dolly out, crane up, crane down, wide shot, medium close-up, overhead shot, rack focus.

Here's why. Researchers building precise video language found that even annotators often misuse common terms without strong guidelines.[1] And the camera-motion paper shows that models confuse similar motions when prompts stay vague.[2] So your job is not to sound sophisticated. Your job is to remove collisions.

A few practical distinctions:

A dolly in moves the camera physically closer. A zoom in changes focal length. Different look, different spatial feeling.

A pan rotates in place. A truck moves sideways through space.

A high angle looks down. An overhead or bird's-eye shot is much stricter and closer to top-down.[1]

A close-up is not the same as a medium close-up. If you care about hand motion, facial expression, or environmental context, that difference matters.

What do better Veo, Sora, and Kling prompts look like in practice?

Better prompts for Veo, Sora, and Kling describe a shot the way a director would brief a cinematographer: specific subject, visible action, clear framing, one deliberate camera move, and concrete lighting. Practical prompting examples consistently work better when they avoid overloaded style jargon and conflicting motion cues.[1][2]

Here are three rewrites I'd actually use.

Rough prompt:
"A futuristic throne room, very cinematic."

Improved prompt:
"A grand futuristic throne room with floor-to-ceiling windows overlooking a glowing city. Wide shot, slightly canted angle. The central figure walks toward the throne as the camera dollies in slowly while rolling clockwise. Bright backlight from the windows, soft rim light on armor, symmetrical guards on both sides. End in a centered medium shot."

Rough prompt:
"A man wakes up in a forest, dramatic camera."

Improved prompt:
"A wounded man lies among wet ferns in a dim blue forest. Tight overhead close-up. The camera rolls slowly clockwise, then trucks left and cranes up to reveal more of the forest floor. Shallow depth of field at first, then deeper focus as a second figure enters frame from the left."

Rough prompt:
"A stylish product video for sneakers."

Improved prompt:
"White sneakers on a black pedestal in a dark studio. Medium shot with centered symmetrical framing. Static for one second, then slow arc clockwise around the shoes. Hard top light, glossy reflections, deep shadows, crisp specular highlights. Finish with a push into a close-up of the laces."

That last point matters for real workflows too. A community post on prompt engineering for video models made the same practical argument: useful prompts usually follow a compact structure like subject, action, scene, camera, style, and vague phrases like "cinematic look" underperform compared with explicit lighting and framing language.[4] That's not a primary source, but it does match what the research is showing.

If you want a shortcut, this is exactly the kind of cleanup I'd automate with Rephrase: turn "make this feel like a moody opening shot" into a prompt with shot size, motion, focus, and lighting in one pass. For more workflows like this, the Rephrase blog is worth browsing.

What is the biggest mistake people make with cinematic prompts?

The biggest mistake is stacking too many aesthetic words and too few camera instructions. Models can imitate style loosely, but they follow shots more reliably when framing, movement, and timing are explicit and non-contradictory.[1][2][3]

I see three repeat offenders.

First, people say "cinematic" instead of specifying composition. Second, they combine multiple incompatible motions in one short shot. Third, they confuse film terms that imply different geometry, like dolly and zoom.

VERTIGO makes the same point from a different angle: better camera generation comes from evaluating framing, composition, and prompt adherence explicitly, not assuming plausible motion alone is enough.[3] In other words, a moving camera is not the same thing as a good shot.

So the habit I'd build is simple. Before you send a prompt, ask: can I sketch the shot from the words alone? If the answer is no, your model probably can't either.

The good news is that this skill compounds fast. Learn a dozen camera terms, use them consistently, and your prompts stop sounding like wishful thinking and start sounding like direction.

References

Documentation & Research

Building a Precise Video Language with Human-AI Oversight - arXiv cs.LG (link)
Geometry-Guided Camera Motion Understanding in VideoLLMs - The Prompt Report (link)
VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation - arXiv cs.AI (link)

Community Examples 4. Seedance 2.0 Prompt Engineering - r/PromptEngineering (link)

Frequently asked

What cinematography terms help most in AI video prompts?

The most useful terms describe shot size, camera movement, angle, lens behavior, focus, and framing. Terms like dolly in, pan left, medium close-up, Dutch angle, rack focus, and overhead shot reduce ambiguity fast.

What is the difference between a dolly and a zoom in prompts?

A dolly changes camera position in space, while a zoom changes focal length without moving the camera. Mixing them up often causes prompts to drift because they imply different visual outcomes.

Blog / Video generation / How to Use Cinematography Terms in Promp…

← All notes

How to Use Cinematography Terms in Prompts

Learn how to use cinematography vocabulary in AI video prompts for Veo, Sora, and Kling so shots land cleaner and more intentional. Try free.

Ilia Ilinskii
Rephrase · April 30, 2026

Video generation7 min read

On this page

Most AI video prompts fail for a boring reason: we ask for "cinematic," but we never say what the camera should actually do.

Key Takeaways

Cinematography vocabulary works like a control language, not decoration.
Precise terms such as shot size, angle, motion, and focus reduce prompt ambiguity.
Research on video understanding and camera control shows models do better with structured, explicit motion primitives.[1][2]
"Cinematic" is too vague on its own; "slow dolly in to a medium close-up with shallow depth of field" is usable.
For faster iteration, tools like Rephrase can turn rough creative intent into tighter video prompts in seconds.

Why does cinematography vocabulary improve AI video prompts?

So if you're prompting Veo, Sora, or Kling, the win is simple: stop writing vibes first. Write camera behavior first.

What belongs in a director's glossary for Veo, Sora, and Kling?

I like to think in seven buckets.

Angle defines viewpoint: eye level, low angle, high angle, overhead, Dutch angle. This changes power, geometry, and subject emphasis.

Lens and depth cover wide lens, telephoto, shallow depth of field, deep focus, fisheye distortion. These terms affect perceived space, not just style.

Time structure helps the model stage the shot: opens with, midway through, then, finally. AI video often gets better when your prompt has sequence logic instead of one big blob.

How should you structure a cinematography-first prompt?

Here's the simple template I use:

Subject + action + setting + shot size + angle + camera movement + lens/focus + lighting/style + time progression

And here's a clean before-and-after.

Prompt type	Example
Before	"A cinematic scene of a woman in a city at sunset, dramatic and emotional."
After	"A woman in her 30s stands alone on a rooftop at sunset, city skyline behind her. Medium close-up at eye level. Slow dolly in as she turns toward camera. Shallow depth of field, warm rim light, soft key from frame left, subtle film grain. Hold steady for the first two seconds, then rack focus from skyline to her face."

The second prompt is longer, but it's also more disciplined. Every phrase does a job.

Which cinematography terms are most worth learning first?

If you only learn twelve terms, make them these: static, pan, tilt, truck, dolly in, dolly out, crane up, crane down, wide shot, medium close-up, overhead shot, rack focus.

A few practical distinctions:

A dolly in moves the camera physically closer. A zoom in changes focal length. Different look, different spatial feeling.

A pan rotates in place. A truck moves sideways through space.

A high angle looks down. An overhead or bird's-eye shot is much stricter and closer to top-down.[1]

A close-up is not the same as a medium close-up. If you care about hand motion, facial expression, or environmental context, that difference matters.

What do better Veo, Sora, and Kling prompts look like in practice?

Here are three rewrites I'd actually use.

Rough prompt:
"A futuristic throne room, very cinematic."

Improved prompt:
"A grand futuristic throne room with floor-to-ceiling windows overlooking a glowing city. Wide shot, slightly canted angle. The central figure walks toward the throne as the camera dollies in slowly while rolling clockwise. Bright backlight from the windows, soft rim light on armor, symmetrical guards on both sides. End in a centered medium shot."

Rough prompt:
"A man wakes up in a forest, dramatic camera."

Improved prompt:
"A wounded man lies among wet ferns in a dim blue forest. Tight overhead close-up. The camera rolls slowly clockwise, then trucks left and cranes up to reveal more of the forest floor. Shallow depth of field at first, then deeper focus as a second figure enters frame from the left."

Rough prompt:
"A stylish product video for sneakers."

Improved prompt:
"White sneakers on a black pedestal in a dark studio. Medium shot with centered symmetrical framing. Static for one second, then slow arc clockwise around the shoes. Hard top light, glossy reflections, deep shadows, crisp specular highlights. Finish with a push into a close-up of the laces."

What is the biggest mistake people make with cinematic prompts?

I see three repeat offenders.

So the habit I'd build is simple. Before you send a prompt, ask: can I sketch the shot from the words alone? If the answer is no, your model probably can't either.

The good news is that this skill compounds fast. Learn a dozen camera terms, use them consistently, and your prompts stop sounding like wishful thinking and start sounding like direction.

References

Documentation & Research

Building a Precise Video Language with Human-AI Oversight - arXiv cs.LG (link)
Geometry-Guided Camera Motion Understanding in VideoLLMs - The Prompt Report (link)
VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation - arXiv cs.AI (link)

Community Examples 4. Seedance 2.0 Prompt Engineering - r/PromptEngineering (link)

Frequently asked

What cinematography terms help most in AI video prompts?

What is the difference between a dolly and a zoom in prompts?

A dolly changes camera position in space, while a zoom changes focal length without moving the camera. Mixing them up often causes prompts to drift because they imply different visual outcomes.