The new constraints — AI models are great at beauty, bad at story

easy

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

AI video models hallucinate the easy stuff: a glass shattering in slow motion, a neon-lit alley, a horse galloping on a beach. They fail the hard stuff: a character who appears in shot 1 matching shot 2, a hand turning a door handle without six fingers, a lipsynced line that matches the phoneme. The mental model for advanced filmmaking is: the tool is a beauty engine. You must supply story, continuity, performance, and rhythm. Understanding WHERE the tool's ceiling is saves you weeks of fruitless iteration.

Demo

Four concrete capability ceilings as of late 2025: (1) Character consistency — Veo 3, Kling 1.6, Runway Gen-4 all support reference images but degrade across >3 shots. (2) Lip sync — dedicated tools (Sync.so, LipDub, HeyGen) are better than the generalist video models. (3) Physical interaction — hands touching objects is still fragile; plan around it. (4) Long-duration shots — most models cap at 10 seconds; a 2-minute film = ~12-20 shots.

# Shot feasibility checklist (ask before prompting)
[ ] Is there a recurring character? If yes → plan for consistency tools (ref images, LoRAs)
[ ] Is there dialogue? If yes → plan for a dedicated lip-sync pass
[ ] Is there hand-object interaction? If yes → compose the shot to hide or obscure it
[ ] Is the shot longer than 10s? If yes → split into two shots connected by a cut
[ ] Does the shot require specific blocking (character moves from X to Y)? → expect 5+ retries
[ ] Is there a fast-motion element (explosion, water)? → choose a model known for motion (Sora, Kling)
[ ] Is it mostly ambience (landscape, detail)? → any model will produce something usable in 1–2 tries

Try it yourself

Pick a 5-minute video you admire. Break it into shots. For each shot, rate on AI feasibility: trivial / moderate / very hard. Notice how hard even 'simple' shots get when they need continuity.

Try to generate the same character across three different shots in three models (Runway, Kling, Luma). Record the drift. This is your budget reality for multi-shot narratives today.

Attempt a simple hand-object interaction shot (character picking up a cup). Count iterations until acceptable. Note the model's 'hand' quality for your capability reference.

Generate a 10-second shot in the highest-detail mode your tool allows. Compare cost-per-second vs 5s shots. Pay attention to where shot-length stops buying quality.

Write a one-page 'capability memo' for your team: what each tool is good/bad at this quarter. Update it monthly — the landscape moves fast.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain the current capability ceilings of AI video models — character consistency, lip sync, hand-object interaction, shot duration — and which are the hardest problems to work around.

2. Why it works (the mechanism)

Walk me through WHY AI video models fail at character consistency: what's happening inside the diffusion process, why reference conditioning is inherently imperfect, and what structural approaches (LoRAs, inpainting, frame-to-frame conditioning) help.

3. Advanced — application & what's next

I'm pitching a 5-minute narrative short with a single recurring protagonist and 3 dialogue scenes. Walk me through the workflow that's realistic in 2026: which tools for which shots, where I need a dedicated lipsync pass, and how to budget the 'fix post-production' hours into the schedule.