Reduces visual tokens by up to 100x using an autoregressive gazing module, enabling 19x faster 4K/1000-frame video understanding.
arXiv · March 13, 2026 · 2603.12254
Why it matters
AutoGaze solves the token explosion problem in high-resolution video understanding by selecting only the most informative visual patches. This allows MLLMs to scale to long-form video (5+ minutes) on consumer-grade hardware while maintaining state-of-the-art accuracy.
From the abstract
Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-s