Blog / Video generation / How Veo 3.1 Changed Video Prompting

How Veo 3.1 Changed Video Prompting

Learn how Veo 3.1 shifts video prompting from text-only guesswork to reference-driven workflows, with examples and a practical playbook. Try free.

Ilia Ilinskii
Rephrase · April 27, 2026

Video generation8 min read

On this page

Key Takeaways Why are reference images replacing prompt whispering?What does the ingredients-to-video workflow look like?How should you write prompts for Veo 3.1 now?Before → after prompt example Why does this workflow improve consistency?How can you build a practical Veo 3.1 workflow?References

Most AI video prompting used to feel like whispering to a black box. You kept adding adjectives, hoping the model would finally "see" the thing in your head.

Key Takeaways

Veo 3.1 pushes prompting toward visual conditioning, where reference images carry identity, style, and scene information better than long text descriptions.
Text still matters, but it matters more for motion, camera direction, timing, and constraints than for exhaustive visual description.
Research backs this shift: reference-guided video editing improves fidelity because language alone is too ambiguous for precise visual control [2].
A strong workflow looks like ingredients, not one giant paragraph: subject image, style image, environment image, then a short action prompt.
Tools like Rephrase help compress messy creative notes into a cleaner prompt once you know what text should actually do.

Why are reference images replacing prompt whispering?

Reference images are replacing prompt whispering because they encode visual specifics more reliably than language. Research on reference-guided video editing shows text-only instructions often fail at exact identity, texture, and style control, while image references improve fidelity and reduce ambiguity in the generation process [2].

Here's the big shift. In older video prompting, we tried to make text do everything. Character design. Wardrobe. lens choice. mood. background details. action. continuity. That worked just enough to become a habit, but not enough to become a workflow.

What's interesting in the newer Veo 3.1 conversation is that the workflow itself has changed. Google's recent Veo 3.1 rollout on Vertex AI frames model choice around production use cases and iteration speed, not just raw prompt craft [1]. And outside the official material, the clearest practical pattern is that users are leaning on reference images to lock the hard parts first, then using text to steer the shot.

That lines up with the research. The Kiwi-Edit paper makes the core point bluntly: natural language is inherently limited when you need exact visual details, specific object identity, or nuanced stylistic characteristics [2]. In plain English, text is bad at being a moodboard.

What does the ingredients-to-video workflow look like?

The ingredients-to-video workflow means you stop writing one overloaded prompt and instead assemble a small set of inputs with clear jobs. Images define what things should look like, while the prompt defines what should happen, how the camera behaves, and what must stay consistent.

I think of it like this: text is your direction, images are your evidence.

Instead of saying, "Create a stylish woman in a red raincoat walking through a neon Tokyo alley at night with cinematic lighting and realistic reflections," you can provide a character reference, a wardrobe/style reference, and an environment reference. Then the text becomes shorter and sharper: what she does, how the camera moves, and what mood the clip should sustain.

That's also consistent with how multimodal video systems are evolving more broadly. The PrevizWhiz paper shows a production workflow where creators start with structured visual scaffolds, then add text and motion guidance on top [3]. Different system, same idea: visual structure first, language second.

Here's the comparison I keep coming back to:

Workflow	What text has to do	Result
Text-only prompting	Describe appearance, style, setting, action, and camera all at once	Higher ambiguity, more drift
Ingredients-to-video	Describe motion, framing, timing, and constraints while references carry appearance	Better consistency and easier iteration

That's why "prompt whispering" starts to feel outdated. You're no longer charming the model with prose. You're briefing it with assets.

How should you write prompts for Veo 3.1 now?

You should write Veo 3.1 prompts as shot directions, not as full visual descriptions. Once references are provided, the prompt should focus on motion, camera, sequence, and constraints, because those are the parts language still handles best in a multimodal workflow [2][3].

Here's what I noticed from both the research and real user behavior: the more visual ingredients you provide, the more your prompt should sound like a director's note.

That means prioritizing things like: camera move, pacing, subject action, emotional beat, transition, and what must remain stable.

It also means dropping a lot of prompt clutter. If the character reference already shows the face, jacket, and silhouette, you do not need to repeat every detail in text unless it's a constraint. Repetition often creates conflict instead of clarity.

Before → after prompt example

Before

A beautiful cinematic woman with short dark hair and a glossy red raincoat walks through a rainy neon alley in Tokyo at night, realistic reflections, stylish lighting, moody atmosphere, ultra detailed, high quality, dramatic scene, cyberpunk feeling

After

Use the reference images for character identity, wardrobe, and alley environment. Medium tracking shot as she walks calmly toward camera through light rain. Neon reflections ripple on the pavement. Keep the red coat silhouette consistent. Slow dolly backward, shallow depth of field, grounded realistic motion, tense but controlled mood.

The second prompt is better because it stops competing with the reference inputs. It gives the model instructions, not redundant description.

If you do this often, this is exactly where Rephrase becomes useful. You can dump rough creative notes, hit a hotkey, and turn them into a cleaner shot-direction prompt without manually rewriting every sentence.

Why does this workflow improve consistency?

This workflow improves consistency because multimodal systems preserve visual anchors better when appearance is carried by explicit references instead of inferred from ambiguous language. Research on reference-guided editing and controllable video workflows repeatedly shows that external visual grounding improves identity preservation, style adherence, and scene coherence [2][3].

The catch is that consistency in video is not just about getting frame one right. It's about keeping the same logic alive across frames.

That's where visual anchors matter. The Demystifying Video Reasoning paper argues that video models rely on persistent internal reference through the denoising process, including a kind of working memory [4]. If that's true, then giving the model stronger visual anchors up front should make the whole generation path less fragile. You're reducing the amount of guessing the model has to do at every step.

This doesn't mean text is dead. It means text gets promoted to a better job.

Use images for: identity, wardrobe, product appearance, background look, and aesthetic style.

Use text for: action, tempo, framing, lens feel, continuity constraints, and emotional direction.

That division of labor is cleaner. And cleaner workflows usually win.

How can you build a practical Veo 3.1 workflow?

A practical Veo 3.1 workflow starts by separating appearance from behavior. Gather visual references first, then write a short prompt that only covers movement, camera, and constraints, and iterate by swapping one ingredient at a time instead of rewriting everything.

Here's the simple process I'd use.

Pick your core ingredients: one image for subject identity, one for style or wardrobe, and one for environment if needed.
Decide what the shot must do: walk, turn, reveal product, pan across scene, hold eye contact.
Write a prompt that describes only the shot mechanics and any non-negotiables.
Run one variation at a time. If something breaks, change the ingredient or the shot direction, not both.
Save good prompt structures as templates. If you want more prompt breakdowns like this, the Rephrase blog is a good place to keep browsing similar workflows.

A community example from r/PromptEngineering describes Veo use for landing page assets and explicitly calls out reference images as the way to keep outputs aligned with a product UI or brand aesthetic [5]. That's not proof by itself, but it matches what the stronger sources are already telling us.

The real upgrade here is not "better prompting." It's better task allocation.

Prompting isn't disappearing in Veo 3.1. It's getting demoted from magician to coordinator. And honestly, that's a good thing.

When you stop forcing text to describe every pixel, your prompts get shorter, your iterations get faster, and your outputs get closer to intent. If you're still writing giant cinematic paragraphs, try the ingredients approach on your next shot. Then, if you want to automate the cleanup step, use Rephrase to turn your rough notes into a tighter prompt in a couple of seconds.

References

Documentation & Research

Introducing Veo 3.1 Lite and a new Veo upscaling capability on Vertex AI - Google Cloud AI Blog (link)
Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance - arXiv / The Prompt Report (link)
PrevizWhiz: Combining Rough 3D Scenes and 2D Video to Guide Generative Video Previsualization - arXiv / The Prompt Report (link)
Demystifying Video Reasoning - arXiv / The Prompt Report (link)

Community Examples 5. Stop paying for B-roll: I made a free guide on using Google Veo to generate video assets for your projects - r/PromptEngineering (link)

Frequently asked

What is the ingredients-to-video workflow in Veo 3.1?

It's a reference-first workflow where you assemble visual ingredients like character looks, environments, and style cues, then use text to describe motion, camera, and sequencing. The goal is to reduce ambiguity that plain text prompts often create.

Can Veo 3.1 still work with text-only prompts?

Yes, but text-only prompting is less reliable when you need repeatable characters, exact art direction, or product-specific visuals. Reference-guided workflows usually produce more controllable outputs.