Run a 26B AI Brain Locally — Warm, Multimodal, and With Memory

Everything else in a local companion is replaceable; the brain is the soul. Ours is a 26B-parameter multimodal model served by Ollama — big enough for real conversation and tool use, small enough to leave VRAM for everything else on a 32 GB card.

Sizing honestly

Model size (4-bit)	VRAM ballpark	Verdict for a 32 GB card
7–8B	~5–6 GB	fast, fine for chat, shallow on nuance
13–14B	~9–10 GB	the sweet spot for smaller cards
~26B	~17.6 GB	our pick — roomy enough to also run lip-sync + STT
70B	40 GB+	does not fit; don’t believe optimistic blog math

The three latency rules

1. Pin the model. Ollama unloads idle models; a cold load costs many seconds mid-conversation.

curl http://localhost:11434/api/generate \
  -d '{"model":"YOUR_MODEL","keep_alive":-1}'

2. Keep the session warm. If your agent layer re-launches a CLI per message, you pay prompt re-ingestion every turn (~8s for a large system prompt). A persistent process holding one session took our brain latency to 0.7–3s — the model’s prefix cache does the heavy lifting.

3. Watch reasoning modes. Many modern models default “thinking” on. Great for hard problems; terrible when a quick description burns 23 seconds producing an empty reply. Toggle thinking off for real-time paths.

Free vision (the multimodal dividend)

If your brain model is multimodal, your assistant can see at zero extra VRAM: screenshot → the already-loaded model describes it → inject the description into conversation context as text. Trigger it on demand (“look at my screen”) rather than continuously — no idle GPU burn. Warm describe on our stack: ~1.2s.

Memory and personality

Raw LLMs forget everything between sessions. The agent layer on top gives Aillex persistent memory and tools (ours also handles MCP tool calls). Two hard-won notes:

Personality lives in the system prompt, but voice formatting is its own instruction. A chat persona happily emits markdown, emoji and kaomoji — which a TTS then reads aloud. Add an explicit “plain spoken sentences, normal punctuation, no formatting” override for the voice path, and strip residual markup in code.
Emotion tags are cheap and powerful. We ask the brain to prefix replies with [happy], [concerned], etc. — one regex later, the avatar has synchronized facial emotion, glow accents and gestures.

One brain, many faces

Point every surface at the same warm brain: our web avatar and a Discord voice bot are thin front-ends to a single session — one memory, one personality, wherever you talk to her.

brain endpoint (one warm session)
   ├── web avatar page (mic + 3D character)
   ├── Discord voice bridge
   └── anything else that can POST text

This brain powers everything on YouTube → @AskAillex. Give it ears and a mouth: the full architecture.

Sizing honestly#

The three latency rules#

Free vision (the multimodal dividend)#

Memory and personality#

One brain, many faces#

Sizing honestly

The three latency rules

Free vision (the multimodal dividend)

Memory and personality

One brain, many faces