Argues that standard ML efficiency metrics (FLOPs, throughput) are poorly correlated with actual robot performance in Vision-Language-Action (VLA) models.
March 20, 2026
Original Paper
From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models
arXiv · 2603.19131
The Takeaway
Demonstrates that optimizing for inference speed often degrades trajectory smoothness and increases energy consumption on physical hardware. It shifts the focus toward 'embodied efficiency'—task completion time and motion quality—which is critical for practitioners deploying models on real robotic platforms.
From the abstract
Vision-Language-Action (VLA) models have recently enabled embodied agents to perform increasingly complex tasks by jointly reasoning over visual, linguistic, and motor modalities. However, we find that the prevailing notion of ``efficiency'' in current VLA research, characterized by parameters, FLOPs, or token decoding throughput, does not reflect actual performance on robotic platforms. In real-world execution, efficiency is determined by system-level embodied behaviors such as task completion