Learn how Veo 3.1 shifts video prompting from text-only guesswork to reference-driven workflows, with examples and a practical playbook. Try free.
Most AI video prompting used to feel like whispering to a black box. You kept adding adjectives, hoping the model would finally "see" the thing in your head.
Reference images are replacing prompt whispering because they encode visual specifics more reliably than language. Research on reference-guided video editing shows text-only instructions often fail at exact identity, texture, and style control, while image references improve fidelity and reduce ambiguity in the generation process [2].
Here's the big shift. In older video prompting, we tried to make text do everything. Character design. Wardrobe. lens choice. mood. background details. action. continuity. That worked just enough to become a habit, but not enough to become a workflow.
What's interesting in the newer Veo 3.1 conversation is that the workflow itself has changed. Google's recent Veo 3.1 rollout on Vertex AI frames model choice around production use cases and iteration speed, not just raw prompt craft [1]. And outside the official material, the clearest practical pattern is that users are leaning on reference images to lock the hard parts first, then using text to steer the shot.
That lines up with the research. The Kiwi-Edit paper makes the core point bluntly: natural language is inherently limited when you need exact visual details, specific object identity, or nuanced stylistic characteristics [2]. In plain English, text is bad at being a moodboard.
The ingredients-to-video workflow means you stop writing one overloaded prompt and instead assemble a small set of inputs with clear jobs. Images define what things should look like, while the prompt defines what should happen, how the camera behaves, and what must stay consistent.
I think of it like this: text is your direction, images are your evidence.
Instead of saying, "Create a stylish woman in a red raincoat walking through a neon Tokyo alley at night with cinematic lighting and realistic reflections," you can provide a character reference, a wardrobe/style reference, and an environment reference. Then the text becomes shorter and sharper: what she does, how the camera moves, and what mood the clip should sustain.
That's also consistent with how multimodal video systems are evolving more broadly. The PrevizWhiz paper shows a production workflow where creators start with structured visual scaffolds, then add text and motion guidance on top [3]. Different system, same idea: visual structure first, language second.
Here's the comparison I keep coming back to:
| Workflow | What text has to do | Result |
|---|---|---|
| Text-only prompting | Describe appearance, style, setting, action, and camera all at once | Higher ambiguity, more drift |
| Ingredients-to-video | Describe motion, framing, timing, and constraints while references carry appearance | Better consistency and easier iteration |
That's why "prompt whispering" starts to feel outdated. You're no longer charming the model with prose. You're briefing it with assets.
You should write Veo 3.1 prompts as shot directions, not as full visual descriptions. Once references are provided, the prompt should focus on motion, camera, sequence, and constraints, because those are the parts language still handles best in a multimodal workflow [2][3].
Here's what I noticed from both the research and real user behavior: the more visual ingredients you provide, the more your prompt should sound like a director's note.
That means prioritizing things like: camera move, pacing, subject action, emotional beat, transition, and what must remain stable.
It also means dropping a lot of prompt clutter. If the character reference already shows the face, jacket, and silhouette, you do not need to repeat every detail in text unless it's a constraint. Repetition often creates conflict instead of clarity.
Before
A beautiful cinematic woman with short dark hair and a glossy red raincoat walks through a rainy neon alley in Tokyo at night, realistic reflections, stylish lighting, moody atmosphere, ultra detailed, high quality, dramatic scene, cyberpunk feeling
After
Use the reference images for character identity, wardrobe, and alley environment. Medium tracking shot as she walks calmly toward camera through light rain. Neon reflections ripple on the pavement. Keep the red coat silhouette consistent. Slow dolly backward, shallow depth of field, grounded realistic motion, tense but controlled mood.
The second prompt is better because it stops competing with the reference inputs. It gives the model instructions, not redundant description.
If you do this often, this is exactly where Rephrase becomes useful. You can dump rough creative notes, hit a hotkey, and turn them into a cleaner shot-direction prompt without manually rewriting every sentence.
This workflow improves consistency because multimodal systems preserve visual anchors better when appearance is carried by explicit references instead of inferred from ambiguous language. Research on reference-guided editing and controllable video workflows repeatedly shows that external visual grounding improves identity preservation, style adherence, and scene coherence [2][3].
The catch is that consistency in video is not just about getting frame one right. It's about keeping the same logic alive across frames.
That's where visual anchors matter. The Demystifying Video Reasoning paper argues that video models rely on persistent internal reference through the denoising process, including a kind of working memory [4]. If that's true, then giving the model stronger visual anchors up front should make the whole generation path less fragile. You're reducing the amount of guessing the model has to do at every step.
This doesn't mean text is dead. It means text gets promoted to a better job.
Use images for: identity, wardrobe, product appearance, background look, and aesthetic style.
Use text for: action, tempo, framing, lens feel, continuity constraints, and emotional direction.
That division of labor is cleaner. And cleaner workflows usually win.
A practical Veo 3.1 workflow starts by separating appearance from behavior. Gather visual references first, then write a short prompt that only covers movement, camera, and constraints, and iterate by swapping one ingredient at a time instead of rewriting everything.
Here's the simple process I'd use.
A community example from r/PromptEngineering describes Veo use for landing page assets and explicitly calls out reference images as the way to keep outputs aligned with a product UI or brand aesthetic [5]. That's not proof by itself, but it matches what the stronger sources are already telling us.
The real upgrade here is not "better prompting." It's better task allocation.
Prompting isn't disappearing in Veo 3.1. It's getting demoted from magician to coordinator. And honestly, that's a good thing.
When you stop forcing text to describe every pixel, your prompts get shorter, your iterations get faster, and your outputs get closer to intent. If you're still writing giant cinematic paragraphs, try the ingredients approach on your next shot. Then, if you want to automate the cleanup step, use Rephrase to turn your rough notes into a tighter prompt in a couple of seconds.
Documentation & Research
Community Examples 5. Stop paying for B-roll: I made a free guide on using Google Veo to generate video assets for your projects - r/PromptEngineering (link)
It's a reference-first workflow where you assemble visual ingredients like character looks, environments, and style cues, then use text to describe motion, camera, and sequencing. The goal is to reduce ambiguity that plain text prompts often create.
Yes, but text-only prompting is less reliable when you need repeatable characters, exact art direction, or product-specific visuals. Reference-guided workflows usually produce more controllable outputs.