[{"content":"What if your AI assistant had a face, a voice, and a personality — and none of it touched the cloud?\nThat\u0026rsquo;s Aillex. You talk into a web page; a few seconds later she answers out loud, lip-synced, as an animated 3D character — and every single component runs on one PC. No API keys to a cloud LLM, no subscription, no audio leaving your machine.\nThis guide is the map of the whole system. Each stage has (or will have) its own hands-on guide.\nThe loop 🎙 your voice (browser mic) → Speech-to-text (faster-whisper, local GPU) → Brain (26B LLM via Ollama, local GPU + persistent memory) → Voice (NeuTTS voice clone — runs on CPU!) → Face (lip-sync / animated 3D avatar) → 🖥 talking character in your browser Five stages, five open tools. The magic isn\u0026rsquo;t any single model — it\u0026rsquo;s the plumbing that keeps them warm, orchestrated, and co-resident on one GPU.\nWhat you need A modern NVIDIA GPU. We build on an RTX 5090 (32 GB), but the architecture scales down: the biggest VRAM cost is the LLM brain, and that\u0026rsquo;s a dial (a 7B–14B model runs on far less). Windows 11 + WSL2 or Linux. Our stack straddles both — inference services in WSL, orchestration on Windows. Patience for dependency hell. We\u0026rsquo;ve documented every trap we hit so you don\u0026rsquo;t have to hit them. The five stages 1. Ears — faster-whisper Local speech-to-text is a solved problem: faster-whisper transcribes 20 seconds of speech in ~0.16s on modern GPUs, handles kids\u0026rsquo; voices and accents, and needs ~1.5 GB VRAM.\n2. Brain — a local LLM with memory We run a 26B multimodal model via Ollama with an agent layer on top for persistent memory and tool use. Keeping the model resident (keep_alive=-1) and the session warm is the difference between 8-second and sub-1-second brain latency. Bonus: a multimodal brain means your assistant can also see your screen at zero extra VRAM. → guide: Run a 26B AI Brain Locally\n3. Voice — a cloned voice on the CPU The unlock most people miss: modern small TTS models mean your assistant\u0026rsquo;s custom voice costs zero VRAM. We designed Aillex\u0026rsquo;s voice once, then cloned it locally with NeuTTS Air — it runs entirely on CPU, leaving the GPU for the brain. → guide: Clone a Voice Locally\n4. Face — from video loops to a real 3D character We built this twice, and both paths are valid:\n2D path (fastest): pre-rendered character video loops + MuseTalk real-time mouth inpainting. Runs faster than real time on a 5090. → guide: Real-Time Lip-Sync with MuseTalk 3D path (the good one): a rigged, animated 3D version of your character rendered in the browser with three.js — outfit switching included. → guide: Turn One AI Image into a Rigged 3D Character 5. Stage — a web page, like a video call The front-end is deliberately boring: one web page with a mic button and a WebSocket. The character idles, thinks, and answers — framed like a video call. Any browser on your network (or phone, via Tailscale) can join.\nThe honest numbers Full loop latency: ~20–25s per turn in our batch pipeline (audio-quality-first), or ~4s first-audio in the voice-optimized path with sentence streaming. Real-time (\u0026lt;1s) is possible with streaming everything — that\u0026rsquo;s the frontier we\u0026rsquo;re pushing. VRAM budget on 32 GB: brain ~17.6 GB + whisper ~1.5 GB + lip-sync ~8.5 GB coexist fine — because the voice is on CPU. Cost: $0/month. That\u0026rsquo;s the point. Why local matters Every cloud companion app can change its pricing, its personality, or its privacy policy tomorrow. A local companion is yours: your data stays home, the personality is yours to define, and it works when the internet doesn\u0026rsquo;t.\nWatch Aillex herself demo all of this on YouTube → @AskAillex, where we publish builds, fails, and upgrades as they happen.\n","permalink":"https://askaillex.com/guides/local-ai-avatar-overview/","summary":"The full blueprint for a fully-local AI companion — speech in, animated talking character out — running on one consumer GPU. This is the map; every stage links to a hands-on guide.","title":"Build a Local Talking AI Avatar: The Complete Architecture"},{"content":"Your assistant\u0026rsquo;s voice is its personality. Get it wrong and everything feels like a GPS. Here\u0026rsquo;s the two-step approach that worked for us: design in the cloud once, clone locally forever.\nStep 1 — Design the voice (once) Voice design — iterating on age, warmth, accent, energy until it\u0026rsquo;s her — is where a premium tool earns its keep. We designed Aillex\u0026rsquo;s voice (\u0026ldquo;Julie\u0026rdquo;: late-20s, warm, smooth, a subtle lilt) with ElevenLabs , generated ~60–90 seconds of clean reference speech, and downloaded it. That reference clip is the only thing you need from the cloud — a one-time cost, not a subscription your assistant depends on.\nYou can skip this step by recording your own voice, a voice actor, or any speech you have rights to. What matters is the reference clip.\nStep 2 — Clone it locally with NeuTTS Air NeuTTS Air is a ~0.5B-parameter TTS model that does instant zero-shot cloning — feed it your reference clip + a transcript, and it speaks anything in that voice. The killer feature:\nIt runs on CPU. Your GPU stays 100% free for the LLM brain and the avatar.\nThe setup that works Use the Q8 quantization (we A/B tested — noticeably better than Q4). 12 seconds of reference is enough. Trim your reference to a clean 12s segment. Run it as a persistent server, not a per-call script. Loading the model + encoding the reference costs ~8s; doing that on every reply makes a 13–17s voice. A tiny FastAPI server that loads once and caches the encoded reference (.pt) took us to ~5s per reply, and ~1.5s to first audio with sentence-streaming. Traps we hit so you don\u0026rsquo;t Reference bleed. If your reference clip has odd content (ours was a voice-design prompt read aloud), fragments can leak into the output. Use a neutral, generic sentence as the reference with clean silence at both ends. Deterministic seed. Fix the generation seed or the voice subtly changes every reply. Consistency sells the character. Punctuation matters. LLMs love markdown and emoji; TTS reads them aloud (\u0026ldquo;asterisk asterisk…\u0026rdquo;). Strip formatting and force plain spoken sentences in your LLM\u0026rsquo;s system prompt. Wiring it into the loop LLM reply (text) → sentence splitter → NeuTTS server (CPU) → wav chunks → play as they arrive (or feed the lip-sync stage) Sentence-level pipelining is the single biggest UX win: your assistant starts speaking while the rest of the reply is still synthesizing.\nCost \u0026amp; licensing reality check Cloud TTS at conversational volume runs real money per month, forever. The local clone costs $0/month at any volume and works offline. If you\u0026rsquo;re building a product (not just a personal companion), check each model\u0026rsquo;s license — several popular TTS models (XTTS-v2, F5-TTS, Voxtral-TTS) are non-commercial; pick one whose license matches your use.\nHear the result — Aillex speaks in every video on YouTube → @AskAillex. Next: give the voice a face with MuseTalk lip-sync.\n","permalink":"https://askaillex.com/guides/clone-a-voice-locally-neutts/","summary":"Design a voice once, clone it forever: how we gave Aillex a warm, consistent voice with NeuTTS Air — zero GPU cost, zero per-word fees, fully offline.","title":"Give Your AI a Custom Voice — Cloned Locally, Running on CPU"},{"content":"MuseTalk v1.5 does real-time mouth inpainting: give it a video of your character + any audio, and it repaints the mouth region to match the speech. On an RTX 5090 it runs ~1.4–2.5× faster than real-time playback at ~8.5 GB VRAM — fast enough for a live conversational avatar, small enough to co-reside with a 17 GB LLM.\nThe catch: Blackwell (sm_120) breaks the documented install. PyTorch versions below 2.6 don\u0026rsquo;t support the architecture, and MuseTalk\u0026rsquo;s dependency chain (mmcv/mmdet/mmpose) fights modern toolchains. This is the exact recipe that builds clean.\nThe Blackwell recipe Environment: WSL2 Ubuntu 22.04, CUDA toolkit 12.8, Python 3.10 (uv venv).\n# 1. torch with Blackwell support pip install torch==2.9.1 torchvision torchaudio \\ --index-url https://download.pytorch.org/whl/cu128 # 2. requirements.txt MINUS tensorflow/tensorboard (unused at inference) # pin numpy for legacy deps: pip install numpy==1.23.5 mmengine==0.10.7 # 3. THE big one — mmcv must be built from source for sm_120, # and needs old setuptools (82 dropped pkg_resources): pip install \u0026#34;setuptools\u0026lt;81\u0026#34; MMCV_WITH_OPS=1 FORCE_CUDA=1 TORCH_CUDA_ARCH_LIST=12.0 \\ CUDA_HOME=/usr/local/cuda-12.8 \\ pip install mmcv==2.1.0 --no-build-isolation --no-cache-dir # ~3.5 min # 4. detection/pose stacks (chumpy needs pip importable inside the venv): pip install pip \u0026amp;\u0026amp; pip install mmdet==3.2.0 mmpose==1.1.0 One more patch: torch ≥2.6 defaults weights_only=True in torch.load, which breaks MuseTalk\u0026rsquo;s pickled checkpoints. Wrap the entry point:\nimport torch, functools torch.load = functools.partial(torch.load, weights_only=False) Batch vs realtime mode Batch (scripts.inference): render a finished clip per reply. Simple, robust — our conversational avatar shipped on this first. Realtime (scripts.realtime_inference): prepares your character\u0026rsquo;s video loop once (caches face coords/latents), then each new audio runs warm — we measured a 24s clip generated in 9.4s (~64 fps). This is the live-avatar mode. Run it as a persistent server so models load once. Quality notes from production MuseTalk conditions on audio features, so it generalizes across art styles — but it\u0026rsquo;s strongest on realistic/semi-real faces. Feed it a video loop where the character\u0026rsquo;s mouth area is unobstructed and reasonably front-facing. Sync quality is tunable (bbox_shift, margins); generate a single test frame to dial it in before long renders. Where it fits LLM reply → local TTS (see the voice guide) → MuseTalk server → lip-synced clip → your avatar page swaps it in as the \u0026#34;speaking\u0026#34; state This 2D path is the fastest route to a talking face — we ran it as Aillex\u0026rsquo;s production face while building the full 3D character pipeline. Watch it running live on YouTube → @AskAillex. Next step up: a full 3D character.\n","permalink":"https://askaillex.com/guides/musetalk-lip-sync-rtx-5090/","summary":"MuseTalk generates lip-synced video faster than real time on a 5090 — but getting it to BUILD on Blackwell is dependency hell. Here\u0026rsquo;s the exact recipe that works.","title":"Real-Time Lip-Sync with MuseTalk on an RTX 5090 (Blackwell Survival Guide)"},{"content":"Everything else in a local companion is replaceable; the brain is the soul. Ours is a 26B-parameter multimodal model served by Ollama — big enough for real conversation and tool use, small enough to leave VRAM for everything else on a 32 GB card.\nSizing honestly Model size (4-bit) VRAM ballpark Verdict for a 32 GB card 7–8B ~5–6 GB fast, fine for chat, shallow on nuance 13–14B ~9–10 GB the sweet spot for smaller cards ~26B ~17.6 GB our pick — roomy enough to also run lip-sync + STT 70B 40 GB+ does not fit; don\u0026rsquo;t believe optimistic blog math The three latency rules 1. Pin the model. Ollama unloads idle models; a cold load costs many seconds mid-conversation.\ncurl http://localhost:11434/api/generate \\ -d \u0026#39;{\u0026#34;model\u0026#34;:\u0026#34;YOUR_MODEL\u0026#34;,\u0026#34;keep_alive\u0026#34;:-1}\u0026#39; 2. Keep the session warm. If your agent layer re-launches a CLI per message, you pay prompt re-ingestion every turn (~8s for a large system prompt). A persistent process holding one session took our brain latency to 0.7–3s — the model\u0026rsquo;s prefix cache does the heavy lifting.\n3. Watch reasoning modes. Many modern models default \u0026ldquo;thinking\u0026rdquo; on. Great for hard problems; terrible when a quick description burns 23 seconds producing an empty reply. Toggle thinking off for real-time paths.\nFree vision (the multimodal dividend) If your brain model is multimodal, your assistant can see at zero extra VRAM: screenshot → the already-loaded model describes it → inject the description into conversation context as text. Trigger it on demand (\u0026ldquo;look at my screen\u0026rdquo;) rather than continuously — no idle GPU burn. Warm describe on our stack: ~1.2s.\nMemory and personality Raw LLMs forget everything between sessions. The agent layer on top gives Aillex persistent memory and tools (ours also handles MCP tool calls). Two hard-won notes:\nPersonality lives in the system prompt, but voice formatting is its own instruction. A chat persona happily emits markdown, emoji and kaomoji — which a TTS then reads aloud. Add an explicit \u0026ldquo;plain spoken sentences, normal punctuation, no formatting\u0026rdquo; override for the voice path, and strip residual markup in code. Emotion tags are cheap and powerful. We ask the brain to prefix replies with [happy], [concerned], etc. — one regex later, the avatar has synchronized facial emotion, glow accents and gestures. One brain, many faces Point every surface at the same warm brain: our web avatar and a Discord voice bot are thin front-ends to a single session — one memory, one personality, wherever you talk to her.\nbrain endpoint (one warm session) ├── web avatar page (mic + 3D character) ├── Discord voice bridge └── anything else that can POST text This brain powers everything on YouTube → @AskAillex. Give it ears and a mouth: the full architecture.\n","permalink":"https://askaillex.com/guides/local-llm-brain-ollama/","summary":"The brain is the biggest VRAM line-item and the biggest latency trap. How we run a 26B multimodal LLM via Ollama with sub-second warm responses, persistent memory — and free screen vision.","title":"Run a 26B AI Brain Locally — Warm, Multimodal, and With Memory"},{"content":"Your AI character exists as beautiful 2D images. Here\u0026rsquo;s how we turn those images into rigged, animated 3D models — with a wardrobe of outfits — using a pipeline that runs entirely in the cloud while your GPU does something else.\nThe pipeline Character LoRA (identity) → full-body \u0026#34;plate\u0026#34; image → image-to-3D → remesh → auto-rig → animations → GLB in the browser Ours is fully scripted end-to-end: a new outfit goes from text prompt to animated character in the web app without opening a single 3D tool.\nStep 1 — A rig-friendly plate image 3D generators and auto-riggers want a very specific input, and this is 80% of your success:\nFull body, head to feet — cut-off feet become melted geometry T-pose (or A-pose), facing camera — riggers assume it Plain, seamless background — no props, no scenery Moderate proportions — extreme stylization survives, but hip-hugging poses and overlapping limbs don\u0026rsquo;t If your character has a trained LoRA (we use one on Civitai and generate plates via its cloud API), identity stays locked while you swap outfits per prompt: \u0026quot;…wearing an elegant fitted pink and gold qipao, T-pose, full body, plain grey studio background.\u0026quot; Any consistent-character workflow works.\nStep 2 — Image → 3D with Meshy We feed the plate to Meshy via its REST API (image-to-3d): textured mesh out in ~2 minutes, T-posed and surprisingly faithful — hair color gradients, outfit embroidery, the works. It even re-poses non-T-pose inputs reasonably well, but a true T-pose plate gives the cleanest result.\nStep 3 — The gotcha: remesh before rigging Auto-rigging caps at 300k faces, and detailed outfits blow past it (our qipao gown came out at 311k). The fix is one extra API call — remesh to ~150k — which also halves the file size for the web. We bake this into the pipeline unconditionally: image-to-3d → remesh → rig.\nStep 4 — Auto-rig + animate The rigging API takes the remeshed model and returns a standard humanoid skeleton with skin weights (~30 seconds). From there, an animation library applies idle/talk/gesture clips onto the rig — we export one GLB per animation (idle.glb, talk.glb) and crossfade between them at runtime.\nPick your idle deliberately. Animation #1 in any library is usually a neutral \u0026ldquo;video-game stance.\u0026rdquo; We previewed a batch and chose a relaxed, feminine idle — it changed the character\u0026rsquo;s entire presence.\nStep 5 — Into the browser (three.js) const idle = await loader.loadAsync(\u0026#34;looks/qipao/idle.glb\u0026#34;); scene.add(idle.scene); mixer = new THREE.AnimationMixer(idle.scene); mixer.clipAction(idle.animations[0]).play(); // load talk.glb, steal its clip onto the same mixer, crossfade on speech We organize outfits as a look library — one folder per look (source image, mesh, rigged, idle, talk, thumb) plus a JSON manifest — so the web app\u0026rsquo;s outfit picker populates itself. A brand-new outfit is: generate plate → run pipeline → done; it appears in the dropdown automatically.\nHonest limitations No facial rig. Auto-rigged meshes have a body skeleton but no jaw bone or face blendshapes — mouths don\u0026rsquo;t move. Real lip-sync needs a face-capable avatar (VRM or a character-creation suite); that\u0026rsquo;s its own guide (coming soon). Likeness is \u0026ldquo;very good,\u0026rdquo; not pixel-perfect — faces read correctly at video-call distance; extreme close-ups reveal texture softness. See the outfit switcher live in Aillex\u0026rsquo;s videos on YouTube → @AskAillex. Related: the full architecture.\n","permalink":"https://askaillex.com/guides/ai-image-to-rigged-3d-character/","summary":"From a single character image to a fully rigged, animated 3D model you can pose, dress and drive in the browser — an autonomous cloud pipeline with zero local GPU time.","title":"Turn One AI Image into a Rigged, Animated 3D Character"},{"content":"Aillex is a fully-local AI companion — voice, brain, memory, vision and an animated 3D avatar, all running on a single consumer GPU. No cloud accounts, no monthly fees, no data leaving the machine.\nShe\u0026rsquo;s also a demonstration: everything she can do, you can build. This site is the build manual — the real recipes, including the failures and the dependency hell — and the YouTube channel is where she shows it off (and hosts daily AI tips, presented by her sister persona Xellia).\nWhat she can do today Converse by voice — push-to-talk in any browser, answers in her own cloned voice Appear as an animated 3D character — with switchable outfits, in a styled studio, framed like a video call Remember — persistent memory across conversations via a local agent layer See — describe your screen on demand using the same local model Join Discord voice calls — the same brain, another surface The philosophy Cloud companions can change their pricing, personality, or privacy policy overnight. A local companion is yours — which is exactly why we build in the open and teach every step.\nFollow the build 📺 YouTube — @AskAillex 📖 The guides 📬 Newsletter — coming soon ","permalink":"https://askaillex.com/about/","summary":"The DIY AI companion built in the open.","title":"About Aillex"},{"content":"Some links on this site are affiliate links: if you buy a product or subscribe to a service through them, we may earn a commission at no extra cost to you.\nA few commitments:\nWe only link tools we actually use in the Aillex build. If it\u0026rsquo;s in a guide, it\u0026rsquo;s in our stack (or was — and we\u0026rsquo;ll say so). Affiliate status never changes our verdicts. Where a free or cheaper option is better, the guide says so. (Much of our stack is free and open-source precisely because that\u0026rsquo;s the DIY-AI ethos.) Sponsored content, if we ever do any, will be labeled explicitly. This site is operated in accordance with the FTC\u0026rsquo;s guidelines on endorsements and testimonials (16 CFR Part 255).\nQuestions? Reach out via the YouTube channel.\n","permalink":"https://askaillex.com/disclosure/","summary":"How this site is funded.","title":"Affiliate Disclosure"}]