Moves VLM grounding from text-based coordinates to a direct visual token selection mechanism via special pointing tokens.
March 31, 2026
Original Paper
MolmoPoint: Better Pointing for VLMs with Grounding Tokens
arXiv · 2603.28069
The Takeaway
It replaces the inefficient coordinate-string generation used by current VLMs with an intuitive cross-attention pointing mechanism. This achieves higher sample efficiency and sets new SOTAs on PointBench and ScreenSpotPro, suggesting a more scalable way to handle vision-language tasks like GUI interaction.
From the abstract
Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appr