AI & ML Paradigm Shift

Moves VLM grounding from text-based coordinates to a direct visual token selection mechanism via special pointing tokens.

March 31, 2026

Original Paper

MolmoPoint: Better Pointing for VLMs with Grounding Tokens

Christopher Clark, Yue Yang, Jae Sung Park, Zixian Ma, Jieyu Zhang, Rohun Tripathi, Mohammadreza Salehi, Sangho Lee, Taira Anderson, Winson Han, Ranjay Krishna

arXiv · 2603.28069

The Takeaway

It replaces the inefficient coordinate-string generation used by current VLMs with an intuitive cross-attention pointing mechanism. This achieves higher sample efficiency and sets new SOTAs on PointBench and ScreenSpotPro, suggesting a more scalable way to handle vision-language tasks like GUI interaction.

From the abstract

Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appr

Read the original paper →

← Back to today's papers