AI & ML Breaks Assumption

Probing of Vision-Language-Action (VLA) models reveals that the action decoder largely ignores the reasoning logic in Chain-of-Thought, relying almost exclusively on object names.

arXiv · March 16, 2026 · 2603.12717

Tuan Duong Trinh, Naveed Akhtar, Basim Azam

Why it matters

This challenges the assumption that 'thinking' in VLAs improves physical task execution; corrupting logic or sequence has little impact, while changing entity names causes failure. This suggests current CoT implementations in robotics may be redundant facades, and future models need tighter coupling between reasoning and motor commands.

From the abstract

Recent Vision-Language-Action (VLA) models increasingly adopt chain-of-thought (CoT) reasoning, generating a natural-language plan before decoding motor commands. This internal text channel between the reasoning module and the action decoder has received no adversarial scrutiny. We ask: which properties of this intermediate plan does the action decoder actually rely on, and can targeted corruption of the reasoning trace alone -- with all inputs left intact -- degrade a robot's physical task perf