AI & ML New Capability

Uses Sparse Autoencoders (SAEs) to prove that Vision-Language-Action models learn steerable motion primitives rather than just memorized sequences.

arXiv · March 20, 2026 · 2603.19183

Aiden Swann, Lachlain McGranahan, Hugo Buurmeijer, Monroe Kennedy III, Mac Schwager

The Takeaway

This is the first mechanistic evidence that VLA models learn generalizable features that can be causally steered. Practitioners can use these insights to interpret why robot policies fail and potentially use SAE feature-steering to correct robot behavior without full fine-tuning.

From the abstract

Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, their generalization is inconsistent: while these models can perform impressively in some settings, fine-tuned variants often fail on novel objects, scenes, and instructions. We apply mechanistic interpretability techniques to better understand the inner workings of VLA models. To probe internal representations, we train Sparse Autoencoders (SAEs) on hidden layer activations