Anatomy of a video prompt

easy

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

A good video prompt is five decisions, not a sentence. You name the subject, the setting, the style, the camera, and the lighting. If you leave any of those blank, the model fills it in — and its defaults are probably not what you wanted. Learning to name all five up front is the difference between 'rolling dice' and 'directing'.

Demo

Structured prompts outperform thin ones because they supply the five decisions a model needs to stop guessing: subject, setting, style, camera, and lighting. Without those labels, the model fills blanks from its training distribution — rarely the aesthetic you had in mind. The pair below shows how each added label narrows the output space toward what you actually want.

BAD:
A cat in a kitchen.
 
GOOD:
Subject: a ginger cat, tail flicking
Setting: a sunlit 1970s kitchen, dust in the air
Style: 35mm film, warm grain, slight halation
Camera: eye-level medium shot, shallow depth of field
Lighting: late-afternoon side-light from a single window
Motion: the cat turns its head toward the camera once, slowly

Try it yourself

Take any one-line prompt you've written before. Rewrite it with the five labels (Subject / Setting / Style / Camera / Lighting / Motion).
Generate one clip from the thin prompt and one from the structured version. Note which surprised you more.
Change only the 'Style' line (e.g. swap '35mm film' for 'Pixar 3D'). Keep everything else. See how much of the output changes.
Remove one label at a time and regenerate. Which label, when missing, causes the biggest drift?

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

Explain in one short paragraph what each part of a structured video prompt does: Subject, Setting, Style, Camera, Lighting, Motion. Use an example where skipping one of them would ruin the shot.

2. Why it works (the mechanism)

Video generation models are trained on captioned clips. Walk me through why a prompt that names Style and Camera produces a more consistent result than a prompt that only names Subject and Setting. What exactly is the model 'looking up' when I write '35mm film, shallow depth of field'?