A mechanistic study reveals that Vision-Language-Action (VLA) models are dominated by visual pathways and often ignore language when visual context is sufficient.
March 20, 2026
Original Paper
Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models
arXiv · 2603.19233
The Takeaway
It challenges the assumption that VLAs 'reason' through multimodal integration; instead, they often use language only as a secondary switch and rely on spatially bound motor programs. This insight is critical for researchers trying to improve robot generalization and instructions-following.
From the abstract
Vision-Language-Action (VLA) models combine perception, language, and motor control in a single architecture, yet how they translate multimodal inputs into actions remains poorly understood. We apply activation injection, sparse autoencoders (SAEs), and linear probes to six models spanning 80M--7B parameters across 394,000+ rollout episodes on four benchmarks. The visual pathway dominates action generation across all architectures: injecting baseline activations into null-prompt episodes recover