Learn how to use cinematography vocabulary in AI video prompts for Veo, Sora, and Kling so shots land cleaner and more intentional. Try free.
Most AI video prompts fail for a boring reason: we ask for "cinematic," but we never say what the camera should actually do.
Cinematography vocabulary improves AI video prompts because it converts fuzzy creative intent into observable, time-based instructions. Recent research shows that models struggle with implicit camera understanding and respond better when motion, framing, and shot structure are described with explicit primitives and consistent terminology.[1][2]
Here's the core idea: film language is a compression format for visual intent. Directors don't tell a cinematographer, "make it cool." They say, "start wide, dolly in, hold eye level, then rack focus to the foreground." That language exists because it removes ambiguity. AI video models need the same thing.
This is not just taste. In Building a Precise Video Language with Human-AI Oversight, researchers argue that weak terminology leads to missing information, inconsistent captions, and misuse of terms like bird's-eye view, close-up, and zoom.[1] In Geometry-Guided Camera Motion Understanding in VideoLLMs, the authors show that models are notably weak at fine-grained camera motion recognition and improve when explicit motion labels are injected into prompts as a structured header.[2]
So if you're prompting Veo, Sora, or Kling, the win is simple: stop writing vibes first. Write camera behavior first.
A useful director's glossary for AI prompting should cover the controllable parts of a shot: shot size, angle, motion, framing, lens, focus, and pacing. These categories map better to how video models interpret scenes over time than broad adjectives like cinematic, dramatic, or beautiful.[1][3]
I like to think in seven buckets.
Shot size tells the model how close we are: extreme wide shot, wide shot, medium shot, medium close-up, close-up. This matters because a "close-up of trembling hands" is a totally different generation problem from "wide shot in a stormy parking lot."
Angle defines viewpoint: eye level, low angle, high angle, overhead, Dutch angle. This changes power, geometry, and subject emphasis.
Camera movement is where a lot of prompts break. Pan, tilt, truck, dolly, crane, roll, static. These are not interchangeable. Research-backed taxonomies treat them as distinct motion primitives for a reason.[2][3]
Lens and depth cover wide lens, telephoto, shallow depth of field, deep focus, fisheye distortion. These terms affect perceived space, not just style.
Focus behavior matters more than most people realize: rack focus, focus pull, foreground sharp / background soft. Recent work on precise video language calls out focus changes as details many datasets miss.[1]
Framing and screen position help stabilize composition: centered subject, off-center left, symmetrical framing, silhouette in background, subject in lower third. VERTIGO also evaluates prompts using composition dimensions like shot scale, shot angle, and screen position, which is a strong clue that these categories matter for control.[3]
Time structure helps the model stage the shot: opens with, midway through, then, finally. AI video often gets better when your prompt has sequence logic instead of one big blob.
A cinematography-first prompt should move from subject and scene into shot design, then describe how the camera changes over time. Structured prompting works because models handle temporally ordered, physically grounded instructions more reliably than dense paragraphs of mixed style language.[2]
Here's the simple template I use:
Subject + action + setting + shot size + angle + camera movement + lens/focus + lighting/style + time progression
And here's a clean before-and-after.
| Prompt type | Example |
|---|---|
| Before | "A cinematic scene of a woman in a city at sunset, dramatic and emotional." |
| After | "A woman in her 30s stands alone on a rooftop at sunset, city skyline behind her. Medium close-up at eye level. Slow dolly in as she turns toward camera. Shallow depth of field, warm rim light, soft key from frame left, subtle film grain. Hold steady for the first two seconds, then rack focus from skyline to her face." |
The second prompt is longer, but it's also more disciplined. Every phrase does a job.
That matches what I noticed in the research. The motion-injection work found that filmmaker-style prompts already help, but adding explicit motion primitives makes descriptions more directionally correct and temporally coherent.[2] In plain English: "tracking forward" is decent, but "dolly in while rolling clockwise" is better if that's what you actually want.
The highest-value cinematography terms are the ones that eliminate common ambiguities: dolly versus zoom, pan versus truck, high angle versus overhead, and close-up versus medium close-up. Learning these first gives you more control than memorizing dozens of fancy film-school labels.[1][2]
If you only learn twelve terms, make them these: static, pan, tilt, truck, dolly in, dolly out, crane up, crane down, wide shot, medium close-up, overhead shot, rack focus.
Here's why. Researchers building precise video language found that even annotators often misuse common terms without strong guidelines.[1] And the camera-motion paper shows that models confuse similar motions when prompts stay vague.[2] So your job is not to sound sophisticated. Your job is to remove collisions.
A few practical distinctions:
A dolly in moves the camera physically closer. A zoom in changes focal length. Different look, different spatial feeling.
A pan rotates in place. A truck moves sideways through space.
A high angle looks down. An overhead or bird's-eye shot is much stricter and closer to top-down.[1]
A close-up is not the same as a medium close-up. If you care about hand motion, facial expression, or environmental context, that difference matters.
Better prompts for Veo, Sora, and Kling describe a shot the way a director would brief a cinematographer: specific subject, visible action, clear framing, one deliberate camera move, and concrete lighting. Practical prompting examples consistently work better when they avoid overloaded style jargon and conflicting motion cues.[1][2]
Here are three rewrites I'd actually use.
Rough prompt:
"A futuristic throne room, very cinematic."
Improved prompt:
"A grand futuristic throne room with floor-to-ceiling windows overlooking a glowing city. Wide shot, slightly canted angle. The central figure walks toward the throne as the camera dollies in slowly while rolling clockwise. Bright backlight from the windows, soft rim light on armor, symmetrical guards on both sides. End in a centered medium shot."
Rough prompt:
"A man wakes up in a forest, dramatic camera."
Improved prompt:
"A wounded man lies among wet ferns in a dim blue forest. Tight overhead close-up. The camera rolls slowly clockwise, then trucks left and cranes up to reveal more of the forest floor. Shallow depth of field at first, then deeper focus as a second figure enters frame from the left."
Rough prompt:
"A stylish product video for sneakers."
Improved prompt:
"White sneakers on a black pedestal in a dark studio. Medium shot with centered symmetrical framing. Static for one second, then slow arc clockwise around the shoes. Hard top light, glossy reflections, deep shadows, crisp specular highlights. Finish with a push into a close-up of the laces."
That last point matters for real workflows too. A community post on prompt engineering for video models made the same practical argument: useful prompts usually follow a compact structure like subject, action, scene, camera, style, and vague phrases like "cinematic look" underperform compared with explicit lighting and framing language.[4] That's not a primary source, but it does match what the research is showing.
If you want a shortcut, this is exactly the kind of cleanup I'd automate with Rephrase: turn "make this feel like a moody opening shot" into a prompt with shot size, motion, focus, and lighting in one pass. For more workflows like this, the Rephrase blog is worth browsing.
The biggest mistake is stacking too many aesthetic words and too few camera instructions. Models can imitate style loosely, but they follow shots more reliably when framing, movement, and timing are explicit and non-contradictory.[1][2][3]
I see three repeat offenders.
First, people say "cinematic" instead of specifying composition. Second, they combine multiple incompatible motions in one short shot. Third, they confuse film terms that imply different geometry, like dolly and zoom.
VERTIGO makes the same point from a different angle: better camera generation comes from evaluating framing, composition, and prompt adherence explicitly, not assuming plausible motion alone is enough.[3] In other words, a moving camera is not the same thing as a good shot.
So the habit I'd build is simple. Before you send a prompt, ask: can I sketch the shot from the words alone? If the answer is no, your model probably can't either.
The good news is that this skill compounds fast. Learn a dozen camera terms, use them consistently, and your prompts stop sounding like wishful thinking and start sounding like direction.
Documentation & Research
Community Examples 4. Seedance 2.0 Prompt Engineering - r/PromptEngineering (link)
The most useful terms describe shot size, camera movement, angle, lens behavior, focus, and framing. Terms like dolly in, pan left, medium close-up, Dutch angle, rack focus, and overhead shot reduce ambiguity fast.
A dolly changes camera position in space, while a zoom changes focal length without moving the camera. Mixing them up often causes prompts to drift because they imply different visual outcomes.