AI & ML Open Release

Releases an 11-billion example dataset and model (RealVLG-R1) for unified real-world visual-language grounding and robotic manipulation.

arXiv · March 17, 2026 · 2603.14880

Linfei Li, Lin Zhang, Ying Shen

The Takeaway

The scale and multi-granularity (masks, grasp poses, contact points) of this release bridge the gap between high-level language grounding and low-level robotic control, enabling zero-shot manipulation in unseen environments.

From the abstract

Visual-language grounding aims to establish semantic correspondences between natural language and visual entities, enabling models to accurately identify and localize target objects based on textual instructions. Existing VLG approaches focus on coarse-grained, object-level localization, while traditional robotic grasping methods rely predominantly on geometric cues and lack language guidance, which limits their applicability in language-driven manipulation scenarios. To address these limitation