Proposes the Vision-Sound-Language-Action (VSLA) paradigm, enabling robots to respond to real-time environmental acoustics during task execution.
arXiv · March 18, 2026 · 2603.16086
The Takeaway
Existing VLA models are 'deaf' during action execution intervals; this framework uses a streaming 'Historizer' and audio world model to let robots verify states (e.g., a click or a spill) that vision might miss.
From the abstract
While recent Vision-Language-Action (VLA) models have begun to incorporate audio, they typically treat sound as static pre-execution prompts or focus exclusively on human speech. This leaves a significant gap in real-time, sound-centric manipulation where fleeting environmental acoustics provide critical state verification during task execution. Consequently, key sounds are easily missed due to low-frequency updates or system latency. This problem is exacerbated by action chunking with open-loop