We generate AI video of the same character every day, two completely different ways: LTX-2.3 locally on our own GPU, and Wan 2.6 in the cloud via Civitai’s generation API. Same source images, same character, production volume on both. Here’s the honest comparison nobody can give you from a benchmark table.

The head-to-head

LTX-2.3 (local)Wan 2.6 (cloud)
HardwareRTX 5090, ~30 GB VRAMnone (runs on Civitai/fal)
Max clip length24 s single pass (longer possible)5 / 10 / 15 s fixed
Speed (our real numbers)~50–60 min for 24 s @ 704×1216~8 min for 15 s @ 720p
Marginal costelectricity1,000 Buzz per clip ($1)
Motion qualitygood, occasionally stiffnoticeably smoother, more natural
Identity fidelityexcellent (it animates your image)excellent (same reason)
Length flexibilityking — long coherent single passescapped, and the cap is real
Privacytotalyour image goes to a cloud
Failsdependency hell, VRAM managementopaque rejections, moderation filters

What we learned in production

Wan’s motion is genuinely better — in bursts. Natural weight shifts, believable hand movement, expressive reactions. For short emotive clips (a reaction, a wave, typing at a keyboard) it beats our local renders visibly.

LTX’s length is the moat. A 24-second coherent single-pass render — no seams, no identity drift — is something the cloud tier simply won’t sell you. Long presenter takes, the backbone of our daily videos, are local-only territory.

The costs converge in a surprising place. Cloud looks cheap per clip until you’re iterating: four rejected takes at ~$1 each versus a local re-render that costs you nothing but time. Local looks free until you price the hour of GPU time. Our rule: iterate local, burst cloud.

Cloud failure modes are opaque. We’ve had generations fail with no error message and content flags on innocuous prompts — budget retry time into any cloud workflow. Local failures at least come with a stack trace.

What we actually do (the hybrid)

  • Long presenter takes / talking bases: LTX local, 24 s passes
  • Short expressive clips (reactions, typing, gestures): Wan 2.6 cloud
  • Stitching: quick masked transitions (flash/glitch) at the joins — the mixed footage cuts together cleanly because both animate the same source images of the same character

One character, two engines, each doing what it’s best at. That’s the actual answer to “which is better” — and it’s why our streamer-cam footage mixes both in a single 90-second sequence.


The character pipeline behind both: one image → rigged 3D · the hardware for the local path: our build