AI & ML Paradigm Shift

Argues that standard ML efficiency metrics (FLOPs, throughput) are poorly correlated with actual robot performance in Vision-Language-Action (VLA) models.

March 20, 2026

Original Paper

From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

Zhuofan Li, Hongkun Yang, Zhenyang Chen, Yangxuan Chen, Yingyan, Chaojian Li

arXiv · 2603.19131

The Takeaway

Demonstrates that optimizing for inference speed often degrades trajectory smoothness and increases energy consumption on physical hardware. It shifts the focus toward 'embodied efficiency'—task completion time and motion quality—which is critical for practitioners deploying models on real robotic platforms.

From the abstract

Vision-Language-Action (VLA) models have recently enabled embodied agents to perform increasingly complex tasks by jointly reasoning over visual, linguistic, and motor modalities. However, we find that the prevailing notion of ``efficiency'' in current VLA research, characterized by parameters, FLOPs, or token decoding throughput, does not reflect actual performance on robotic platforms. In real-world execution, efficiency is determined by system-level embodied behaviors such as task completion