Achieves hour-scale real-time human animation by solving the unbounded memory growth and inconsistent noise states in autoregressive diffusion.
arXiv · March 13, 2026 · 2603.11746
Why it matters
The introduction of Neighbor Forcing and structured ConvKV memory allows for infinite video generation with constant memory usage. This effectively shatters the hardware-imposed temporal limits that currently constrain video generation practitioners to short clips.
From the abstract
Autoregressive (AR) diffusion models offer a promising framework for sequential generation tasks such as video synthesis by combining diffusion modeling with causal inference. Although they support streaming generation, existing AR diffusion methods struggle to scale efficiently. In this paper, we identify two key challenges in hour-scale real-time human animation. First, most forcing strategies propagate sample-level representations with mismatched diffusion states, causing inconsistent learnin