Voiceover, dialogue, and interview — treating language as craft

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Most AI-generated ads are ruined by the voice, not the visuals. A perfect-looking hero and a robotic narration collapse the whole spot. Voice in commercials is a specific craft: line readings are directed; pacing is shaped; silences are load-bearing. AI tools (ElevenLabs, Play.ht, HeyGen) are good enough for spec and decent at hero when directed well. Real human VO still wins for tier-1 work — but the gap narrowed fast, and the filmmaker who learns to direct AI voice like a VO director directs an actor is suddenly shipping top-10% audio with zero session costs.

Demo

Direction techniques that transfer to AI VO: (1) Give the system a ROLE not a voice — 'a tired parent who has found the answer,' not 'a warm male voice.' (2) Write punctuation like a screenplay — periods are breaths, commas are half-breaths, ellipses are decisions. (3) Generate 5 takes of each line, pick the best — same as a real VO session. (4) Treat pacing as separate — render at 1.0x speed, slow to 0.9x in post if it reads hurried. (5) Re-generate any line that 'reads AI' until it doesn't. The voice is not the bottleneck — the direction is.

# Line-by-line VO direction (ElevenLabs / Play.ht / Murf)
 
Line 1: "Mornings are a battle."
  Character:  tired parent
  Emotion:    resigned, dry, not complaining
  Pace:       medium, slight pause before "battle"
  Takes:      5 — pick the one that is the most tired, not the most dramatic
 
Line 2: "Until we found this."
  Character:  same
  Emotion:    relief, quiet smile
  Pace:       slower, emphasis on "this"
  Takes:      5 — pick the one where the smile is almost audible
 
Line 3: "Breakfast in two minutes. Done."
  Character:  same
  Emotion:    matter-of-fact, unforced
  Pace:       slightly clipped, confident
  Takes:      5 — pick the one that doesn't oversell
 
VO QA checklist (kill any take that fails):
  [ ] Does it sound like AI? If yes → regenerate
  [ ] Are breaths in the right places? If no → re-edit or regen
  [ ] Does the last word land? If it lifts up → regen with "declarative tone"
  [ ] Does the pacing match picture? If no → time-stretch carefully or regen

Try it yourself

Pick an ad you love; transcribe the VO. Count words per second. Note where the reader pauses and breathes. Your AI-directed VO should mirror these densities.

Write a 30-second script with 4–5 lines. Generate each in 5 takes across 3 voices. Cut the best take of each line together. You've just done the work of a VO session for $0.

Direct a 'troubled-sincere' line (hard emotional register for AI) — iterate until the take is believable. Some registers are currently out of reach; know which ones.

Take the same script and run it against an AI-generated voice vs a real VO (Fiverr session costs $50–150). Compare bluntly. This calibrates when to pay for real VO.

Build a shortlist of 3 AI voices you trust and use repeatedly. A consistent voice becomes part of your brand's sonic identity.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain why VO direction — not the voice model — is the limiting factor in AI-generated narration, and give me the 4–5 direction techniques that transfer from a real VO session.

2. Why it works (the mechanism)

Walk me through how modern neural-TTS models handle emotion and pacing: what gets controlled at the prompt level, what gets controlled at the inference settings (stability, style, similarity), and why certain emotional registers (sincerity, menace) are harder to render than others.

3. Advanced — application & what's next

I'm producing a 10-spot campaign with a consistent narrator voice across every ad. Walk me through the pipeline: voice selection, direction style-guide, per-line take strategy, QA rubric, and when to escalate a line to a real VO artist.