AI & ML Efficiency Breakthrough

Reduces visual tokens by up to 100x using an autoregressive gazing module, enabling 19x faster 4K/1000-frame video understanding.

arXiv · March 13, 2026 · 2603.12254

Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen, Aaron Reite, Boyi Li, Jan Kautz, Song Han, David M. Chan, Pavlo Molchanov, Trevor Darrell, Hongxu Yin

Why it matters

AutoGaze solves the token explosion problem in high-resolution video understanding by selecting only the most informative visual patches. This allows MLLMs to scale to long-form video (5+ minutes) on consumer-grade hardware while maintaining state-of-the-art accuracy.

From the abstract

Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-s

Read the original paper →

← Back to today's papers