What if your AI assistant had a face, a voice, and a personality — and none of it touched the cloud?

That’s Aillex. You talk into a web page; a few seconds later she answers out loud, lip-synced, as an animated 3D character — and every single component runs on one PC. No API keys to a cloud LLM, no subscription, no audio leaving your machine.

This guide is the map of the whole system. Each stage has (or will have) its own hands-on guide.

The loop

🎙 your voice (browser mic)
   → Speech-to-text        (faster-whisper, local GPU)
   → Brain                 (26B LLM via Ollama, local GPU + persistent memory)
   → Voice                 (NeuTTS voice clone — runs on CPU!)
   → Face                  (lip-sync / animated 3D avatar)
   → 🖥 talking character in your browser

Five stages, five open tools. The magic isn’t any single model — it’s the plumbing that keeps them warm, orchestrated, and co-resident on one GPU.

What you need

  • A modern NVIDIA GPU. We build on an RTX 5090 (32 GB), but the architecture scales down: the biggest VRAM cost is the LLM brain, and that’s a dial (a 7B–14B model runs on far less).
  • Windows 11 + WSL2 or Linux. Our stack straddles both — inference services in WSL, orchestration on Windows.
  • Patience for dependency hell. We’ve documented every trap we hit so you don’t have to hit them.

The five stages

1. Ears — faster-whisper

Local speech-to-text is a solved problem: faster-whisper transcribes 20 seconds of speech in ~0.16s on modern GPUs, handles kids’ voices and accents, and needs ~1.5 GB VRAM.

2. Brain — a local LLM with memory

We run a 26B multimodal model via Ollama with an agent layer on top for persistent memory and tool use. Keeping the model resident (keep_alive=-1) and the session warm is the difference between 8-second and sub-1-second brain latency. Bonus: a multimodal brain means your assistant can also see your screen at zero extra VRAM. → guide: Run a 26B AI Brain Locally

3. Voice — a cloned voice on the CPU

The unlock most people miss: modern small TTS models mean your assistant’s custom voice costs zero VRAM. We designed Aillex’s voice once, then cloned it locally with NeuTTS Air — it runs entirely on CPU, leaving the GPU for the brain. → guide: Clone a Voice Locally

4. Face — from video loops to a real 3D character

We built this twice, and both paths are valid:

5. Stage — a web page, like a video call

The front-end is deliberately boring: one web page with a mic button and a WebSocket. The character idles, thinks, and answers — framed like a video call. Any browser on your network (or phone, via Tailscale) can join.

The honest numbers

  • Full loop latency: ~20–25s per turn in our batch pipeline (audio-quality-first), or ~4s first-audio in the voice-optimized path with sentence streaming. Real-time (<1s) is possible with streaming everything — that’s the frontier we’re pushing.
  • VRAM budget on 32 GB: brain ~17.6 GB + whisper ~1.5 GB + lip-sync ~8.5 GB coexist fine — because the voice is on CPU.
  • Cost: $0/month. That’s the point.

Why local matters

Every cloud companion app can change its pricing, its personality, or its privacy policy tomorrow. A local companion is yours: your data stays home, the personality is yours to define, and it works when the internet doesn’t.


Watch Aillex herself demo all of this on YouTube → @AskAillex, where we publish builds, fails, and upgrades as they happen.