Introduces explicit spatial tokens (segmentation/depth) into the autoregressive sequence of LVLMs to enable precise 3D/2D grounding.
arXiv · March 20, 2026 · 2603.18795
The Takeaway
By generating dense depth and semantic segmentation tokens as a 'spatial chain-of-thought' before answering, the model achieves state-of-the-art spatial reasoning. This bridges the gap between semantic understanding and geometric precision in vision-language models.
From the abstract
Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong m