Achieves real-time, low-latency talking avatar generation at 34ms per frame using a one-step streaming diffusion framework.
March 17, 2026
Original Paper
AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising
arXiv · 2603.14331
The Takeaway
It solves the 'exposure bias' in streaming avatars through dual-anchor temporal forcing and collapses the diffusion process into a single step via two-stage distillation. This provides a production-ready path for high-fidelity, interactive digital humans on consumer-grade GPUs.
From the abstract
Real-time talking avatar generation requires low latency and minute-level temporal stability. Autoregressive (AR) forcing enables streaming inference but suffers from exposure bias, which causes errors to accumulate and become irreversible over long rollouts. In contrast, full-sequence diffusion transformers mitigate drift but remain computationally prohibitive for real-time long-form synthesis. We present AvatarForcing, a one-step streaming diffusion framework that denoises a fixed local-future