AI & ML New Capability

Integrates tactile perception into video-action models to enable high-fidelity force modulation in contact-rich robotic tasks.

March 25, 2026

Original Paper

VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

Haoran Yuan, Weigang Yi, Zhenyu Zhang, Wendi Chen, Yuchen Mo, Jiashi Yin, Xinzhuo Li, Xiangyu Zeng, Chuan Wen, Cewu Lu, Katherine Driggs-Campbell, Ismini Lourentzou

arXiv · 2603.23481

The Takeaway

Current Vision-Language-Action (VLA) models fail in tasks where critical state is not visually observable (e.g., handling fragile objects). VTAM introduces tactile streams that outperform standard baselines by 80% on high-precision tasks like picking up potato chips.

From the abstract

Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions a