Decouples high-level reasoning from low-level motor control in robotics using a visual prompting interface.
March 24, 2026
Original Paper
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
arXiv · 2603.22003
The Takeaway
Instead of treating Vision-Language-Action (VLA) models as black boxes, this method uses 'visual prompts' (like bounding boxes and crosshairs) as a standardized interface between planning and execution. This improves spatial precision and robustness in out-of-distribution robotic tasks.
From the abstract
Vision-Language-Action (VLA) models typically map visual observations and linguistic instructions directly to robotic control signals. This "black-box" mapping forces a single forward pass to simultaneously handle instruction interpretation, spatial grounding, and low-level control, often leading to poor spatial precision and limited robustness in out-of-distribution scenarios. To address these limitations, we propose VP-VLA, a dual-system framework that decouples high-level reasoning and low-le