We generate AI video of the same character every day, two completely different ways: LTX-2.3 locally on our own GPU, and Wan 2.6 in the cloud via Civitai’s generation API. Same source images, same character, production volume on both. Here’s the honest comparison nobody can give you from a benchmark table.
The head-to-head
| LTX-2.3 (local) | Wan 2.6 (cloud) | |
|---|---|---|
| Hardware | RTX 5090, ~30 GB VRAM | none (runs on Civitai/fal) |
| Max clip length | 24 s single pass (longer possible) | 5 / 10 / 15 s fixed |
| Speed (our real numbers) | ~50–60 min for 24 s @ 704×1216 | ~8 min for 15 s @ 720p |
| Marginal cost | electricity | |
| Motion quality | good, occasionally stiff | noticeably smoother, more natural |
| Identity fidelity | excellent (it animates your image) | excellent (same reason) |
| Length flexibility | king — long coherent single passes | capped, and the cap is real |
| Privacy | total | your image goes to a cloud |
| Fails | dependency hell, VRAM management | opaque rejections, moderation filters |
What we learned in production
Wan’s motion is genuinely better — in bursts. Natural weight shifts, believable hand movement, expressive reactions. For short emotive clips (a reaction, a wave, typing at a keyboard) it beats our local renders visibly.
LTX’s length is the moat. A 24-second coherent single-pass render — no seams, no identity drift — is something the cloud tier simply won’t sell you. Long presenter takes, the backbone of our daily videos, are local-only territory.
The costs converge in a surprising place. Cloud looks cheap per clip until you’re iterating: four rejected takes at ~$1 each versus a local re-render that costs you nothing but time. Local looks free until you price the hour of GPU time. Our rule: iterate local, burst cloud.
Cloud failure modes are opaque. We’ve had generations fail with no error message and content flags on innocuous prompts — budget retry time into any cloud workflow. Local failures at least come with a stack trace.
What we actually do (the hybrid)
- Long presenter takes / talking bases: LTX local, 24 s passes
- Short expressive clips (reactions, typing, gestures): Wan 2.6 cloud
- Stitching: quick masked transitions (flash/glitch) at the joins — the mixed footage cuts together cleanly because both animate the same source images of the same character
One character, two engines, each doing what it’s best at. That’s the actual answer to “which is better” — and it’s why our streamer-cam footage mixes both in a single 90-second sequence.
The character pipeline behind both: one image → rigged 3D · the hardware for the local path: our build