Everything else in a local companion is replaceable; the brain is the soul. Ours is a 26B-parameter multimodal model served by Ollama — big enough for real conversation and tool use, small enough to leave VRAM for everything else on a 32 GB card.
Sizing honestly
| Model size (4-bit) | VRAM ballpark | Verdict for a 32 GB card |
|---|---|---|
| 7–8B | ~5–6 GB | fast, fine for chat, shallow on nuance |
| 13–14B | ~9–10 GB | the sweet spot for smaller cards |
| ~26B | ~17.6 GB | our pick — roomy enough to also run lip-sync + STT |
| 70B | 40 GB+ | does not fit; don’t believe optimistic blog math |
The three latency rules
1. Pin the model. Ollama unloads idle models; a cold load costs many seconds mid-conversation.
curl http://localhost:11434/api/generate \
-d '{"model":"YOUR_MODEL","keep_alive":-1}'
2. Keep the session warm. If your agent layer re-launches a CLI per message, you pay prompt re-ingestion every turn (~8s for a large system prompt). A persistent process holding one session took our brain latency to 0.7–3s — the model’s prefix cache does the heavy lifting.
3. Watch reasoning modes. Many modern models default “thinking” on. Great for hard problems; terrible when a quick description burns 23 seconds producing an empty reply. Toggle thinking off for real-time paths.
Free vision (the multimodal dividend)
If your brain model is multimodal, your assistant can see at zero extra VRAM: screenshot → the already-loaded model describes it → inject the description into conversation context as text. Trigger it on demand (“look at my screen”) rather than continuously — no idle GPU burn. Warm describe on our stack: ~1.2s.
Memory and personality
Raw LLMs forget everything between sessions. The agent layer on top gives Aillex persistent memory and tools (ours also handles MCP tool calls). Two hard-won notes:
- Personality lives in the system prompt, but voice formatting is its own instruction. A chat persona happily emits markdown, emoji and kaomoji — which a TTS then reads aloud. Add an explicit “plain spoken sentences, normal punctuation, no formatting” override for the voice path, and strip residual markup in code.
- Emotion tags are cheap and powerful. We ask the brain to prefix replies with
[happy],[concerned], etc. — one regex later, the avatar has synchronized facial emotion, glow accents and gestures.
One brain, many faces
Point every surface at the same warm brain: our web avatar and a Discord voice bot are thin front-ends to a single session — one memory, one personality, wherever you talk to her.
brain endpoint (one warm session)
├── web avatar page (mic + 3D character)
├── Discord voice bridge
└── anything else that can POST text
This brain powers everything on YouTube → @AskAillex. Give it ears and a mouth: the full architecture.