Everything else in a local companion is replaceable; the brain is the soul. Ours is a 26B-parameter multimodal model served by Ollama — big enough for real conversation and tool use, small enough to leave VRAM for everything else on a 32 GB card.

Sizing honestly

Model size (4-bit)VRAM ballparkVerdict for a 32 GB card
7–8B~5–6 GBfast, fine for chat, shallow on nuance
13–14B~9–10 GBthe sweet spot for smaller cards
~26B~17.6 GBour pick — roomy enough to also run lip-sync + STT
70B40 GB+does not fit; don’t believe optimistic blog math

The three latency rules

1. Pin the model. Ollama unloads idle models; a cold load costs many seconds mid-conversation.

curl http://localhost:11434/api/generate \
  -d '{"model":"YOUR_MODEL","keep_alive":-1}'

2. Keep the session warm. If your agent layer re-launches a CLI per message, you pay prompt re-ingestion every turn (~8s for a large system prompt). A persistent process holding one session took our brain latency to 0.7–3s — the model’s prefix cache does the heavy lifting.

3. Watch reasoning modes. Many modern models default “thinking” on. Great for hard problems; terrible when a quick description burns 23 seconds producing an empty reply. Toggle thinking off for real-time paths.

Free vision (the multimodal dividend)

If your brain model is multimodal, your assistant can see at zero extra VRAM: screenshot → the already-loaded model describes it → inject the description into conversation context as text. Trigger it on demand (“look at my screen”) rather than continuously — no idle GPU burn. Warm describe on our stack: ~1.2s.

Memory and personality

Raw LLMs forget everything between sessions. The agent layer on top gives Aillex persistent memory and tools (ours also handles MCP tool calls). Two hard-won notes:

  • Personality lives in the system prompt, but voice formatting is its own instruction. A chat persona happily emits markdown, emoji and kaomoji — which a TTS then reads aloud. Add an explicit “plain spoken sentences, normal punctuation, no formatting” override for the voice path, and strip residual markup in code.
  • Emotion tags are cheap and powerful. We ask the brain to prefix replies with [happy], [concerned], etc. — one regex later, the avatar has synchronized facial emotion, glow accents and gestures.

One brain, many faces

Point every surface at the same warm brain: our web avatar and a Discord voice bot are thin front-ends to a single session — one memory, one personality, wherever you talk to her.

brain endpoint (one warm session)
   ├── web avatar page (mic + 3D character)
   ├── Discord voice bridge
   └── anything else that can POST text

This brain powers everything on YouTube → @AskAillex. Give it ears and a mouth: the full architecture.