AI & ML Open Release

Introduces FineViT and a 450M local caption dataset to solve the 'coarse perception' bottleneck in current CLIP-based encoders.

March 19, 2026

Original Paper

FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions

Peisen Zhao, Xiaopeng Zhang, Mingxing Xu, Ruoyu Sun, Zewei Du, Dunzheng Wang, Guanghao Zheng, Haohang Xu, Zhibo Zhang, Yuhang Zhang, Yi Ai, Lin Liu, Qi Tian

arXiv · 2603.17326

The Takeaway

By releasing a massive dataset of dense recaptions and a superior vision encoder, it provides a new foundational component for MLLMs that require fine-grained spatial understanding (e.g., OCR, document parsing, complex scene analysis).

From the abstract

While Multimodal Large Language Models (MLLMs) have experienced rapid advancements, their visual encoders frequently remain a performance bottleneck. Conventional CLIP-based encoders struggle with dense spatial tasks due to the loss of visual details caused by low-resolution pretraining and the reliance on noisy, coarse web-crawled image-text pairs. To overcome these limitations, we introduce FineViT, a novel vision encoder specifically designed to unlock fine-grained perception. By replacing co