X-World is a controllable, action-conditioned multi-camera world model that simulates realistic future video observations for end-to-end driving.
March 23, 2026
Original Paper
X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving
arXiv · 2603.19979
The Takeaway
It provides a bridge between pure generative models and robotics simulators, allowing developers to test driving policies in a 'real-world simulator' that respects commanded actions, road geometry, and temporal consistency across multiple camera views.
From the abstract
Scalable and reliable evaluation is increasingly critical in the end-to-end era of autonomous driving, where vision--language--action (VLA) policies directly map raw sensor streams to driving actions. Yet, current evaluation pipelines still rely heavily on real-world road testing, which is costly, biased toward limited scenario coverage, and difficult to reproduce. These challenges motivate a real-world simulator that can generate realistic future observations under proposed actions, while remai