Yesterday we published a four-minute YouTube video about voice cloning. Nothing unusual — except that the narrator is a cloned voice explaining its own creation, the on-screen presenter is an AI-generated character, and the editing was done by a script. Here’s the honest anatomy.

The assembly line

script (markdown, chaptered)
  → voice-over        (chapter by chapter, her cloned voice)
  → word timestamps   (faster-whisper — drives the karaoke captions)
  → visuals           (branded cards, animated terminal, site captures)
  → corner cam        ("streamer" clips of her talking / typing / reacting)
  → compositor        (ffmpeg: main + cam overlay + captions + chapter joins)
  → brand intro       (logo reveal + generated sting)
  → upload            (YouTube API, chaptered description)

Total human input: a script review, three creative picks (logo, plate images, music), and one bug report.

The pieces worth stealing

The corner cam is a video-game trick. We pre-rendered a small library of “webcam” clips — talking, typing, reacting — from one AI-generated image of her streaming setup. The compositor swaps them per chapter like animation states: typing during the terminal walkthrough, reacting on the reveal. Viewers read it as a live facecam.

Captions are word-timed by transcription. We don’t guess caption timing — we transcribe our own generated voice-over with word-level timestamps and highlight each word as it’s spoken. The narrator transcribes herself.

The terminal is rendered, not recorded. Real commands, typed out programmatically at reading speed. Crisper than a screen recording and reproducible when a command changes.

The A/B moment is real. The blind test in the video — cloud original vs. local clone — is the actual audio from both systems, same sentence. We just put a waveform behind it and got out of the way.

The bug we shipped (and what it taught us)

The first cut had a subtle flaw: the voice slowly drifted ahead of the captions as the video progressed. Root cause: each chapter’s video was padded 0.4s longer than its audio, so at every chapter seam the next voice line started early — ~0.4s per chapter, ~3 seconds by the end.

One apad filter fixed it permanently. The reason it existed at all: video and audio tracks must be equal length before concatenation, not after. If you’re building any multi-segment pipeline, measure both track durations per segment — the container duration lies to you.

Why bother?

Because this is the thesis of the whole project: a local AI isn’t just a chatbot — it’s labor. The same machine that chats with my kids produces broadcast-passable video unattended. Every episode from here costs a script and a review.


Build the components yourself: the voice · the character · the architecture · the machine