AI & ML Efficiency Breakthrough

Generates 2-minute 480p videos on a single H200 GPU through a hierarchical KV-cache strategy that compresses context by 32x.

March 27, 2026

Original Paper

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, Kaipeng Zhang

arXiv · 2603.25730

The Takeaway

This addresses the intractable KV-cache growth in long-video generation, enabling 24x temporal extrapolation even when trained on short clips. The novel three-partition cache (sink, mid, recent tokens) provides a principled way to maintain global semantics and local coherence under strict memory bounds.

From the abstract

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve