Local LTX vs Cloud Wan 2.6: We Generated the Same Character Both Ways

We generate AI video of the same character every day, two completely different ways: LTX-2.3 locally on our own GPU, and Wan 2.6 in the cloud via Civitai’s generation API. Same source images, same character, production volume on both. Here’s the honest comparison nobody can give you from a benchmark table.

The head-to-head

	LTX-2.3 (local)	Wan 2.6 (cloud)
Hardware	RTX 5090, ~30 GB VRAM	none (runs on Civitai/fal)
Max clip length	24 s single pass (longer possible)	5 / 10 / 15 s fixed
Speed (our real numbers)	~50–60 min for 24 s @ 704×1216	~8 min for 15 s @ 720p
Marginal cost	electricity	~~1,000 Buzz per clip (~~$1)
Motion quality	good, occasionally stiff	noticeably smoother, more natural
Identity fidelity	excellent (it animates your image)	excellent (same reason)
Length flexibility	king — long coherent single passes	capped, and the cap is real
Privacy	total	your image goes to a cloud
Fails	dependency hell, VRAM management	opaque rejections, moderation filters

What we learned in production

Wan’s motion is genuinely better — in bursts. Natural weight shifts, believable hand movement, expressive reactions. For short emotive clips (a reaction, a wave, typing at a keyboard) it beats our local renders visibly.

LTX’s length is the moat. A 24-second coherent single-pass render — no seams, no identity drift — is something the cloud tier simply won’t sell you. Long presenter takes, the backbone of our daily videos, are local-only territory.

The costs converge in a surprising place. Cloud looks cheap per clip until you’re iterating: four rejected takes at ~$1 each versus a local re-render that costs you nothing but time. Local looks free until you price the hour of GPU time. Our rule: iterate local, burst cloud.

Cloud failure modes are opaque. We’ve had generations fail with no error message and content flags on innocuous prompts — budget retry time into any cloud workflow. Local failures at least come with a stack trace.

What we actually do (the hybrid)

Long presenter takes / talking bases: LTX local, 24 s passes
Short expressive clips (reactions, typing, gestures): Wan 2.6 cloud
Stitching: quick masked transitions (flash/glitch) at the joins — the mixed footage cuts together cleanly because both animate the same source images of the same character

One character, two engines, each doing what it’s best at. That’s the actual answer to “which is better” — and it’s why our streamer-cam footage mixes both in a single 90-second sequence.

The character pipeline behind both: one image → rigged 3D · the hardware for the local path: our build

The head-to-head#

What we learned in production#

What we actually do (the hybrid)#

The head-to-head

What we learned in production

What we actually do (the hybrid)