Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
AI video models hallucinate the easy stuff: a glass shattering in slow motion, a neon-lit alley, a horse galloping on a beach. They fail the hard stuff: a character who appears in shot 1 matching shot 2, a hand turning a door handle without six fingers, a lipsynced line that matches the phoneme. The mental model for advanced filmmaking is: the tool is a beauty engine. You must supply story, continuity, performance, and rhythm. Understanding WHERE the tool's ceiling is saves you weeks of fruitless iteration.
Four concrete capability ceilings as of late 2025: (1) Character consistency — Veo 3, Kling 1.6, Runway Gen-4 all support reference images but degrade across >3 shots. (2) Lip sync — dedicated tools (Sync.so, LipDub, HeyGen) are better than the generalist video models. (3) Physical interaction — hands touching objects is still fragile; plan around it. (4) Long-duration shots — most models cap at 10 seconds; a 2-minute film = ~12-20 shots.
# Shot feasibility checklist (ask before prompting)
[ ] Is there a recurring character? If yes → plan for consistency tools (ref images, LoRAs)
[ ] Is there dialogue? If yes → plan for a dedicated lip-sync pass
[ ] Is there hand-object interaction? If yes → compose the shot to hide or obscure it
[ ] Is the shot longer than 10s? If yes → split into two shots connected by a cut
[ ] Does the shot require specific blocking (character moves from X to Y)? → expect 5+ retries
[ ] Is there a fast-motion element (explosion, water)? → choose a model known for motion (Sora, Kling)
[ ] Is it mostly ambience (landscape, detail)? → any model will produce something usable in 1–2 triesUse these three in order. Each builds on the one before.
In one paragraph, explain the current capability ceilings of AI video models — character consistency, lip sync, hand-object interaction, shot duration — and which are the hardest problems to work around.
Walk me through WHY AI video models fail at character consistency: what's happening inside the diffusion process, why reference conditioning is inherently imperfect, and what structural approaches (LoRAs, inpainting, frame-to-frame conditioning) help.
I'm pitching a 5-minute narrative short with a single recurring protagonist and 3 dialogue scenes. Walk me through the workflow that's realistic in 2026: which tools for which shots, where I need a dedicated lipsync pass, and how to budget the 'fix post-production' hours into the schedule.