AI & ML Efficiency Breakthrough

Achieves a 50x reduction in visual tokens for Video-LLMs while preserving over 90% of baseline performance.

March 24, 2026

Original Paper

Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention

Junhao Du, Jialong Xue, Anqi Li, Jincheng Dai, Guo Lu

arXiv · 2603.21957

The Takeaway

The paper reformulates token compression as a global spatiotemporal allocation task, allowing models to operate at ultra-low retention ratios (2%). This drastically reduces FLOPs and memory consumption for long-video understanding without requiring model retraining, making high-quality video understanding feasible on consumer hardware.

From the abstract

Video large language models (Video-LLMs) face high computational costs due to large volumes of visual tokens. Existing token compression methods typically adopt a two-stage spatiotemporal compression strategy, relying on stage-specific metrics and an implicit assumption of spatiotemporal separability. Under extremely low retention ratios, however, such approaches often result in unbalanced allocation and loss of visual evidence essential for question answering. We reformulate token compression a