Real-Time Lip-Sync with MuseTalk on an RTX 5090 (Blackwell Survival Guide)

MuseTalk v1.5 does real-time mouth inpainting: give it a video of your character + any audio, and it repaints the mouth region to match the speech. On an RTX 5090 it runs ~1.4–2.5× faster than real-time playback at ~8.5 GB VRAM — fast enough for a live conversational avatar, small enough to co-reside with a 17 GB LLM.

The catch: Blackwell (sm_120) breaks the documented install. PyTorch versions below 2.6 don’t support the architecture, and MuseTalk’s dependency chain (mmcv/mmdet/mmpose) fights modern toolchains. This is the exact recipe that builds clean.

The Blackwell recipe

Environment: WSL2 Ubuntu 22.04, CUDA toolkit 12.8, Python 3.10 (uv venv).

# 1. torch with Blackwell support
pip install torch==2.9.1 torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/cu128

# 2. requirements.txt MINUS tensorflow/tensorboard (unused at inference)
#    pin numpy for legacy deps:
pip install numpy==1.23.5 mmengine==0.10.7

# 3. THE big one — mmcv must be built from source for sm_120,
#    and needs old setuptools (82 dropped pkg_resources):
pip install "setuptools<81"
MMCV_WITH_OPS=1 FORCE_CUDA=1 TORCH_CUDA_ARCH_LIST=12.0 \
CUDA_HOME=/usr/local/cuda-12.8 \
  pip install mmcv==2.1.0 --no-build-isolation --no-cache-dir   # ~3.5 min

# 4. detection/pose stacks (chumpy needs pip importable inside the venv):
pip install pip && pip install mmdet==3.2.0 mmpose==1.1.0

One more patch: torch ≥2.6 defaults weights_only=True in torch.load, which breaks MuseTalk’s pickled checkpoints. Wrap the entry point:

import torch, functools
torch.load = functools.partial(torch.load, weights_only=False)

Batch vs realtime mode

Batch (scripts.inference): render a finished clip per reply. Simple, robust — our conversational avatar shipped on this first.
Realtime (scripts.realtime_inference): prepares your character’s video loop once (caches face coords/latents), then each new audio runs warm — we measured a 24s clip generated in 9.4s (~64 fps). This is the live-avatar mode. Run it as a persistent server so models load once.

Quality notes from production

MuseTalk conditions on audio features, so it generalizes across art styles — but it’s strongest on realistic/semi-real faces.
Feed it a video loop where the character’s mouth area is unobstructed and reasonably front-facing.
Sync quality is tunable (bbox_shift, margins); generate a single test frame to dial it in before long renders.

Where it fits

LLM reply → local TTS (see the voice guide) → MuseTalk server
   → lip-synced clip → your avatar page swaps it in as the "speaking" state

This 2D path is the fastest route to a talking face — we ran it as Aillex’s production face while building the full 3D character pipeline.

Watch it running live on YouTube → @AskAillex. Next step up: a full 3D character.

The Blackwell recipe#

Batch vs realtime mode#

Quality notes from production#

Where it fits#

The Blackwell recipe

Batch vs realtime mode

Quality notes from production

Where it fits