MuseTalk v1.5 does real-time mouth inpainting: give it a video of your character + any audio, and it repaints the mouth region to match the speech. On an RTX 5090 it runs ~1.4–2.5× faster than real-time playback at ~8.5 GB VRAM — fast enough for a live conversational avatar, small enough to co-reside with a 17 GB LLM.
The catch: Blackwell (sm_120) breaks the documented install. PyTorch versions below 2.6 don’t support the architecture, and MuseTalk’s dependency chain (mmcv/mmdet/mmpose) fights modern toolchains. This is the exact recipe that builds clean.
The Blackwell recipe
Environment: WSL2 Ubuntu 22.04, CUDA toolkit 12.8, Python 3.10 (uv venv).
# 1. torch with Blackwell support
pip install torch==2.9.1 torchvision torchaudio \
--index-url https://download.pytorch.org/whl/cu128
# 2. requirements.txt MINUS tensorflow/tensorboard (unused at inference)
# pin numpy for legacy deps:
pip install numpy==1.23.5 mmengine==0.10.7
# 3. THE big one — mmcv must be built from source for sm_120,
# and needs old setuptools (82 dropped pkg_resources):
pip install "setuptools<81"
MMCV_WITH_OPS=1 FORCE_CUDA=1 TORCH_CUDA_ARCH_LIST=12.0 \
CUDA_HOME=/usr/local/cuda-12.8 \
pip install mmcv==2.1.0 --no-build-isolation --no-cache-dir # ~3.5 min
# 4. detection/pose stacks (chumpy needs pip importable inside the venv):
pip install pip && pip install mmdet==3.2.0 mmpose==1.1.0
One more patch: torch ≥2.6 defaults weights_only=True in torch.load, which breaks MuseTalk’s pickled checkpoints. Wrap the entry point:
import torch, functools
torch.load = functools.partial(torch.load, weights_only=False)
Batch vs realtime mode
- Batch (
scripts.inference): render a finished clip per reply. Simple, robust — our conversational avatar shipped on this first. - Realtime (
scripts.realtime_inference): prepares your character’s video loop once (caches face coords/latents), then each new audio runs warm — we measured a 24s clip generated in 9.4s (~64 fps). This is the live-avatar mode. Run it as a persistent server so models load once.
Quality notes from production
- MuseTalk conditions on audio features, so it generalizes across art styles — but it’s strongest on realistic/semi-real faces.
- Feed it a video loop where the character’s mouth area is unobstructed and reasonably front-facing.
- Sync quality is tunable (
bbox_shift, margins); generate a single test frame to dial it in before long renders.
Where it fits
LLM reply → local TTS (see the voice guide) → MuseTalk server
→ lip-synced clip → your avatar page swaps it in as the "speaking" state
This 2D path is the fastest route to a talking face — we ran it as Aillex’s production face while building the full 3D character pipeline.
Watch it running live on YouTube → @AskAillex. Next step up: a full 3D character.