Generates 2-minute 480p videos on a single H200 GPU through a hierarchical KV-cache strategy that compresses context by 32x.
March 27, 2026
Original Paper
PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference
arXiv · 2603.25730
The Takeaway
This addresses the intractable KV-cache growth in long-video generation, enabling 24x temporal extrapolation even when trained on short clips. The novel three-partition cache (sink, mid, recent tokens) provides a principled way to maintain global semantics and local coherence under strict memory bounds.
From the abstract
Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve