AI & ML Open Release

Introduces a unified evaluation harness for Vision-Language-Action (VLA) models that standardizes disparate protocols and exposes hidden flaws in published SOTA models.

arXiv · March 17, 2026 · 2603.13966

Suhwan Choi, Yunsung Lee, Yubeen Park, Chris Dongjoo Kim, Ranjay Krishna, Dieter Fox, Youngjae Yu

The Takeaway

Robotics researchers can now evaluate models across 13+ benchmarks with a single interface, achieving up to 47x throughput via Docker-based isolation. The audit reveals critical reproducibility issues in current VLA research, such as undocumented normalization stats that drastically change results.

From the abstract

Vision Language Action VLA models are typically evaluated using per benchmark scripts maintained independently by each model repository, leading to duplicated code, dependency conflicts, and underspecified protocols. We present vla eval, an open source evaluation harness that decouples model inference from benchmark execution through a WebSocket msgpack protocol with Docker based environment isolation. Models integrate once by implementing a single predict() method; benchmarks integrate once via

Read the original paper →

← Back to today's papers