Releases an 11-billion example dataset and model (RealVLG-R1) for unified real-world visual-language grounding and robotic manipulation.
arXiv · March 17, 2026 · 2603.14880
The Takeaway
The scale and multi-granularity (masks, grasp poses, contact points) of this release bridge the gap between high-level language grounding and low-level robotic control, enabling zero-shot manipulation in unseen environments.
From the abstract
Visual-language grounding aims to establish semantic correspondences between natural language and visual entities, enabling models to accurately identify and localize target objects based on textual instructions. Existing VLG approaches focus on coarse-grained, object-level localization, while traditional robotic grasping methods rely predominantly on geometric cues and lack language guidance, which limits their applicability in language-driven manipulation scenarios. To address these limitation