Your assistant’s voice is its personality. Get it wrong and everything feels like a GPS. Here’s the two-step approach that worked for us: design in the cloud once, clone locally forever.
Step 1 — Design the voice (once)
Voice design — iterating on age, warmth, accent, energy until it’s her — is where a premium tool earns its keep. We designed Aillex’s voice (“Julie”: late-20s, warm, smooth, a subtle lilt) with ElevenLabs , generated ~60–90 seconds of clean reference speech, and downloaded it. That reference clip is the only thing you need from the cloud — a one-time cost, not a subscription your assistant depends on.
You can skip this step by recording your own voice, a voice actor, or any speech you have rights to. What matters is the reference clip.
Step 2 — Clone it locally with NeuTTS Air
NeuTTS Air is a ~0.5B-parameter TTS model that does instant zero-shot cloning — feed it your reference clip + a transcript, and it speaks anything in that voice. The killer feature:
It runs on CPU. Your GPU stays 100% free for the LLM brain and the avatar.
The setup that works
- Use the Q8 quantization (we A/B tested — noticeably better than Q4).
- 12 seconds of reference is enough. Trim your reference to a clean 12s segment.
- Run it as a persistent server, not a per-call script. Loading the model + encoding the reference costs ~8s; doing that on every reply makes a 13–17s voice. A tiny FastAPI server that loads once and caches the encoded reference (
.pt) took us to ~5s per reply, and ~1.5s to first audio with sentence-streaming.
Traps we hit so you don’t
- Reference bleed. If your reference clip has odd content (ours was a voice-design prompt read aloud), fragments can leak into the output. Use a neutral, generic sentence as the reference with clean silence at both ends.
- Deterministic seed. Fix the generation seed or the voice subtly changes every reply. Consistency sells the character.
- Punctuation matters. LLMs love markdown and emoji; TTS reads them aloud (“asterisk asterisk…”). Strip formatting and force plain spoken sentences in your LLM’s system prompt.
Wiring it into the loop
LLM reply (text) → sentence splitter → NeuTTS server (CPU)
→ wav chunks → play as they arrive (or feed the lip-sync stage)
Sentence-level pipelining is the single biggest UX win: your assistant starts speaking while the rest of the reply is still synthesizing.
Cost & licensing reality check
Cloud TTS at conversational volume runs real money per month, forever. The local clone costs $0/month at any volume and works offline. If you’re building a product (not just a personal companion), check each model’s license — several popular TTS models (XTTS-v2, F5-TTS, Voxtral-TTS) are non-commercial; pick one whose license matches your use.
Hear the result — Aillex speaks in every video on YouTube → @AskAillex. Next: give the voice a face with MuseTalk lip-sync.