AI & ML Efficiency Breakthrough

Achieves 16x prefill speedup for video models by using reinforcement learning to dynamically compress visual tokens based on temporal 'surprise'.

March 30, 2026

Original Paper

Dynamic Token Compression for Efficient Video Understanding through Reinforcement Learning

Shida Wang, YongXiang Hua, Zhou Tao, Haoyu Cao, Linli Xu

arXiv · 2603.26365

The Takeaway

Context redundancy is the primary bottleneck for long-form video understanding; this framework allows models to maintain 99.5% performance while discarding 90% of tokens, drastically reducing the compute needed for long-context multimodal tasks.

From the abstract

Multimodal Large Language Models have demonstrated remarkable capabilities in video understanding, yet face prohibitive computational costs and performance degradation from ''context rot'' due to massive visual token redundancy. Existing compression strategies typically rely on heuristics or fixed transformations that are often decoupled from the downstream task objectives, limiting their adaptability and effectiveness. To address this, we propose SCORE (Surprise-augmented token COmpression via