Your assistant’s voice is its personality. Get it wrong and everything feels like a GPS. Here’s the two-step approach that worked for us: design in the cloud once, clone locally forever.

Step 1 — Design the voice (once)

Voice design — iterating on age, warmth, accent, energy until it’s her — is where a premium tool earns its keep. We designed Aillex’s voice (“Julie”: late-20s, warm, smooth, a subtle lilt) with ElevenLabs , generated ~60–90 seconds of clean reference speech, and downloaded it. That reference clip is the only thing you need from the cloud — a one-time cost, not a subscription your assistant depends on.

You can skip this step by recording your own voice, a voice actor, or any speech you have rights to. What matters is the reference clip.

Step 2 — Clone it locally with NeuTTS Air

NeuTTS Air is a ~0.5B-parameter TTS model that does instant zero-shot cloning — feed it your reference clip + a transcript, and it speaks anything in that voice. The killer feature:

It runs on CPU. Your GPU stays 100% free for the LLM brain and the avatar.

The setup that works

  • Use the Q8 quantization (we A/B tested — noticeably better than Q4).
  • 12 seconds of reference is enough. Trim your reference to a clean 12s segment.
  • Run it as a persistent server, not a per-call script. Loading the model + encoding the reference costs ~8s; doing that on every reply makes a 13–17s voice. A tiny FastAPI server that loads once and caches the encoded reference (.pt) took us to ~5s per reply, and ~1.5s to first audio with sentence-streaming.

Traps we hit so you don’t

  1. Reference bleed. If your reference clip has odd content (ours was a voice-design prompt read aloud), fragments can leak into the output. Use a neutral, generic sentence as the reference with clean silence at both ends.
  2. Deterministic seed. Fix the generation seed or the voice subtly changes every reply. Consistency sells the character.
  3. Punctuation matters. LLMs love markdown and emoji; TTS reads them aloud (“asterisk asterisk…”). Strip formatting and force plain spoken sentences in your LLM’s system prompt.

Wiring it into the loop

LLM reply (text) → sentence splitter → NeuTTS server (CPU)
   → wav chunks → play as they arrive (or feed the lip-sync stage)

Sentence-level pipelining is the single biggest UX win: your assistant starts speaking while the rest of the reply is still synthesizing.

Cost & licensing reality check

Cloud TTS at conversational volume runs real money per month, forever. The local clone costs $0/month at any volume and works offline. If you’re building a product (not just a personal companion), check each model’s license — several popular TTS models (XTTS-v2, F5-TTS, Voxtral-TTS) are non-commercial; pick one whose license matches your use.


Hear the result — Aillex speaks in every video on YouTube → @AskAillex. Next: give the voice a face with MuseTalk lip-sync.