What if your AI assistant had a face, a voice, and a personality — and none of it touched the cloud?
That’s Aillex. You talk into a web page; a few seconds later she answers out loud, lip-synced, as an animated 3D character — and every single component runs on one PC. No API keys to a cloud LLM, no subscription, no audio leaving your machine.
This guide is the map of the whole system. Each stage has (or will have) its own hands-on guide.
The loop
🎙 your voice (browser mic)
→ Speech-to-text (faster-whisper, local GPU)
→ Brain (26B LLM via Ollama, local GPU + persistent memory)
→ Voice (NeuTTS voice clone — runs on CPU!)
→ Face (lip-sync / animated 3D avatar)
→ 🖥 talking character in your browser
Five stages, five open tools. The magic isn’t any single model — it’s the plumbing that keeps them warm, orchestrated, and co-resident on one GPU.
What you need
- A modern NVIDIA GPU. We build on an RTX 5090 (32 GB), but the architecture scales down: the biggest VRAM cost is the LLM brain, and that’s a dial (a 7B–14B model runs on far less).
- Windows 11 + WSL2 or Linux. Our stack straddles both — inference services in WSL, orchestration on Windows.
- Patience for dependency hell. We’ve documented every trap we hit so you don’t have to hit them.
The five stages
1. Ears — faster-whisper
Local speech-to-text is a solved problem: faster-whisper transcribes 20 seconds of speech in ~0.16s on modern GPUs, handles kids’ voices and accents, and needs ~1.5 GB VRAM.
2. Brain — a local LLM with memory
We run a 26B multimodal model via Ollama with an agent layer on top for persistent memory and tool use. Keeping the model resident (keep_alive=-1) and the session warm is the difference between 8-second and sub-1-second brain latency. Bonus: a multimodal brain means your assistant can also see your screen at zero extra VRAM. → guide: Run a 26B AI Brain Locally
3. Voice — a cloned voice on the CPU
The unlock most people miss: modern small TTS models mean your assistant’s custom voice costs zero VRAM. We designed Aillex’s voice once, then cloned it locally with NeuTTS Air — it runs entirely on CPU, leaving the GPU for the brain. → guide: Clone a Voice Locally
4. Face — from video loops to a real 3D character
We built this twice, and both paths are valid:
- 2D path (fastest): pre-rendered character video loops + MuseTalk real-time mouth inpainting. Runs faster than real time on a 5090. → guide: Real-Time Lip-Sync with MuseTalk
- 3D path (the good one): a rigged, animated 3D version of your character rendered in the browser with three.js — outfit switching included. → guide: Turn One AI Image into a Rigged 3D Character
5. Stage — a web page, like a video call
The front-end is deliberately boring: one web page with a mic button and a WebSocket. The character idles, thinks, and answers — framed like a video call. Any browser on your network (or phone, via Tailscale) can join.
The honest numbers
- Full loop latency: ~20–25s per turn in our batch pipeline (audio-quality-first), or ~4s first-audio in the voice-optimized path with sentence streaming. Real-time (<1s) is possible with streaming everything — that’s the frontier we’re pushing.
- VRAM budget on 32 GB: brain ~17.6 GB + whisper ~1.5 GB + lip-sync ~8.5 GB coexist fine — because the voice is on CPU.
- Cost: $0/month. That’s the point.
Why local matters
Every cloud companion app can change its pricing, its personality, or its privacy policy tomorrow. A local companion is yours: your data stays home, the personality is yours to define, and it works when the internet doesn’t.
Watch Aillex herself demo all of this on YouTube → @AskAillex, where we publish builds, fails, and upgrades as they happen.