AI & ML New Capability

CanViT is the first task-agnostic active-vision foundation model that reconstructs scenes using low-resolution 'glimpses' with 19.5x fewer FLOPs than existing models.

March 25, 2026

Original Paper

CanViT: Toward Active-Vision Foundation Models

Yohaï-Eliel Berreby, Sabrina Du, Audrey Durand, B. Suresh Krishna

arXiv · 2603.22570

The Takeaway

It enables efficient, biologically-inspired perception where a model selectively scans a scene rather than processing the entire high-resolution frame at once. For practitioners, this offers a path toward scalable vision models for robotics and edge devices that can handle large environments with minimal compute.

From the abstract

Active computer vision promises efficient, biologically plausible perception through sequential, localized glimpses, but lacks scalable general-purpose architectures and pretraining pipelines. As a result, Active-Vision Foundation Models (AVFMs) have remained unexplored. We introduce CanViT, the first task- and policy-agnostic AVFM. CanViT uses scene-relative RoPE to bind a retinotopic Vision Transformer backbone and a spatiotopic scene-wide latent workspace, the canvas. Efficient interaction wi

Read the original paper →

← Back to today's papers