Enables training-free infinite video generation (hour-scale) by using evolving memory tokens to solve identity drift and motion stagnation.
arXiv · March 16, 2026 · 2603.12513
Why it matters
Current autoregressive diffusion models suffer from 'amnesia' as they slide their attention window, leading to character changes or frozen motion. MemRoPE provides a fixed-size cache that compresses global history, allowing for consistent, hour-long video synthesis without retraining.
From the abstract
Autoregressive diffusion enables real-time frame streaming, yet existing sliding-window caches discard past context, causing fidelity degradation, identity drift, and motion stagnation over long horizons. Current approaches preserve a fixed set of early tokens as attention sinks, but this static anchor cannot reflect the evolving content of a growing video. We introduce MemRoPE, a training-free framework with two co-designed components. Memory Tokens continuously compress all past keys into dual