AI & ML New Capability

Proposes the Vision-Sound-Language-Action (VSLA) paradigm, enabling robots to respond to real-time environmental acoustics during task execution.

arXiv · March 18, 2026 · 2603.16086

Chang Nie, Tianchen Deng, Guangming Wang, Zhe Liu, Hesheng Wang

The Takeaway

Existing VLA models are 'deaf' during action execution intervals; this framework uses a streaming 'Historizer' and audio world model to let robots verify states (e.g., a click or a spill) that vision might miss.

From the abstract

While recent Vision-Language-Action (VLA) models have begun to incorporate audio, they typically treat sound as static pre-execution prompts or focus exclusively on human speech. This leaves a significant gap in real-time, sound-centric manipulation where fleeting environmental acoustics provide critical state verification during task execution. Consequently, key sounds are easily missed due to low-frequency updates or system latency. This problem is exacerbated by action chunking with open-loop