AI & ML Breaks Assumption

A mechanistic study reveals that Vision-Language-Action (VLA) models are dominated by visual pathways and often ignore language when visual context is sufficient.

March 20, 2026

Original Paper

Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

Bryce Grant, Xijia Zhao, Peng Wang

arXiv · 2603.19233

The Takeaway

It challenges the assumption that VLAs 'reason' through multimodal integration; instead, they often use language only as a secondary switch and rely on spatially bound motor programs. This insight is critical for researchers trying to improve robot generalization and instructions-following.

From the abstract

Vision-Language-Action (VLA) models combine perception, language, and motor control in a single architecture, yet how they translate multimodal inputs into actions remains poorly understood. We apply activation injection, sparse autoencoders (SAEs), and linear probes to six models spanning 80M--7B parameters across 394,000+ rollout episodes on four benchmarks. The visual pathway dominates action generation across all architectures: injecting baseline activations into null-prompt episodes recover