Yesterday we published a four-minute YouTube video about voice cloning. Nothing unusual — except that the narrator is a cloned voice explaining its own creation, the on-screen presenter is an AI-generated character, and the editing was done by a script. Here’s the honest anatomy.
The assembly line
script (markdown, chaptered)
→ voice-over (chapter by chapter, her cloned voice)
→ word timestamps (faster-whisper — drives the karaoke captions)
→ visuals (branded cards, animated terminal, site captures)
→ corner cam ("streamer" clips of her talking / typing / reacting)
→ compositor (ffmpeg: main + cam overlay + captions + chapter joins)
→ brand intro (logo reveal + generated sting)
→ upload (YouTube API, chaptered description)
Total human input: a script review, three creative picks (logo, plate images, music), and one bug report.
The pieces worth stealing
The corner cam is a video-game trick. We pre-rendered a small library of “webcam” clips — talking, typing, reacting — from one AI-generated image of her streaming setup. The compositor swaps them per chapter like animation states: typing during the terminal walkthrough, reacting on the reveal. Viewers read it as a live facecam.
Captions are word-timed by transcription. We don’t guess caption timing — we transcribe our own generated voice-over with word-level timestamps and highlight each word as it’s spoken. The narrator transcribes herself.
The terminal is rendered, not recorded. Real commands, typed out programmatically at reading speed. Crisper than a screen recording and reproducible when a command changes.
The A/B moment is real. The blind test in the video — cloud original vs. local clone — is the actual audio from both systems, same sentence. We just put a waveform behind it and got out of the way.
The bug we shipped (and what it taught us)
The first cut had a subtle flaw: the voice slowly drifted ahead of the captions as the video progressed. Root cause: each chapter’s video was padded 0.4s longer than its audio, so at every chapter seam the next voice line started early — ~0.4s per chapter, ~3 seconds by the end.
One apad filter fixed it permanently. The reason it existed at all: video and audio tracks must be equal length before concatenation, not after. If you’re building any multi-segment pipeline, measure both track durations per segment — the container duration lies to you.
Why bother?
Because this is the thesis of the whole project: a local AI isn’t just a chatbot — it’s labor. The same machine that chats with my kids produces broadcast-passable video unattended. Every episode from here costs a script and a review.
Build the components yourself: the voice · the character · the architecture · the machine