AI & ML Efficiency Breakthrough

Replaces visual token compression with sparse, dynamically selected vision-language interactions in VLLMs.

March 25, 2026

Original Paper

VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Yassine Ouali, Georgios Tzimiropoulos

arXiv · 2603.23495

The Takeaway

Most efficiency methods drop visual information, which hurts fine-grained reasoning. VISOR keeps all pixels but sparsifies the attention layers, matching SOTA performance while drastically reducing FLOPs for high-resolution vision-language tasks.

From the abstract

Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressin