Introduces a unified evaluation harness for Vision-Language-Action (VLA) models that standardizes disparate protocols and exposes hidden flaws in published SOTA models.
arXiv · March 17, 2026 · 2603.13966
The Takeaway
Robotics researchers can now evaluate models across 13+ benchmarks with a single interface, achieving up to 47x throughput via Docker-based isolation. The audit reveals critical reproducibility issues in current VLA research, such as undocumented normalization stats that drastically change results.
From the abstract
Vision Language Action VLA models are typically evaluated using per benchmark scripts maintained independently by each model repository, leading to duplicated code, dependency conflicts, and underspecified protocols. We present vla eval, an open source evaluation harness that decouples model inference from benchmark execution through a WebSocket msgpack protocol with Docker based environment isolation. Models integrate once by implementing a single predict() method; benchmarks integrate once via