Shifts world model evaluation from visual fidelity to 'Simulative Reasoning,' revealing a massive gap in current AI's ability to plan.
March 30, 2026
Original Paper
World Reasoning Arena
arXiv · 2603.25887
The Takeaway
It identifies that a model can generate a high-fidelity video of a ball falling but fail to reason about what happens if a hand catches it. This benchmark reorients world model research toward purposeful action and counterfactual reasoning.
From the abstract
World models (WMs) are intended to serve as internal simulators of the real world that enable agents to understand, anticipate, and act upon complex environments. Existing WM benchmarks remain narrowly focused on next-state prediction and visual fidelity, overlooking the richer simulation capabilities required for intelligent behavior. To address this gap, we introduce WR-Arena, a comprehensive benchmark for evaluating WMs along three fundamental dimensions of next world simulation: (i) Action S