CanViT is the first task-agnostic active-vision foundation model that reconstructs scenes using low-resolution 'glimpses' with 19.5x fewer FLOPs than existing models.
March 25, 2026
Original Paper
CanViT: Toward Active-Vision Foundation Models
arXiv · 2603.22570
The Takeaway
It enables efficient, biologically-inspired perception where a model selectively scans a scene rather than processing the entire high-resolution frame at once. For practitioners, this offers a path toward scalable vision models for robotics and edge devices that can handle large environments with minimal compute.
From the abstract
Active computer vision promises efficient, biologically plausible perception through sequential, localized glimpses, but lacks scalable general-purpose architectures and pretraining pipelines. As a result, Active-Vision Foundation Models (AVFMs) have remained unexplored. We introduce CanViT, the first task- and policy-agnostic AVFM. CanViT uses scene-relative RoPE to bind a retinotopic Vision Transformer backbone and a spatiotopic scene-wide latent workspace, the canvas. Efficient interaction wi