AI & ML New Capability

Introduces explicit spatial tokens (segmentation/depth) into the autoregressive sequence of LVLMs to enable precise 3D/2D grounding.

arXiv · March 20, 2026 · 2603.18795

Yuchen Li, Amanmeet Garg, Shalini Chaudhuri, Rui Zhao, Garin Kessler

The Takeaway

By generating dense depth and semantic segmentation tokens as a 'spatial chain-of-thought' before answering, the model achieves state-of-the-art spatial reasoning. This bridges the gap between semantic understanding and geometric precision in vision-language models.

From the abstract

Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong m

Read the original paper →

← Back to today's papers