Generative 3D world models are used to scale Sim-to-Real reinforcement learning for robot Vision-Language-Action (VLA) models.
arXiv · March 20, 2026 · 2603.18532
The Takeaway
It overcomes the data scarcity in robotics by using generative models to create 'digital twins' for training, enabling VLAs to scale without being restricted to specific physical environments. This allows for a 1.25x speedup in task completion and significantly higher real-world success rates (75% vs 21%).
From the abstract
The strong performance of large vision-language models (VLMs) trained with reinforcement learning (RL) has motivated similar approaches for fine-tuning vision-language-action (VLA) models in robotics. Many recent works fine-tune VLAs directly in the real world to avoid addressing the sim-to-real gap. While real-world RL circumvents sim-to-real issues, it inherently limits the generality of the resulting VLA, as scaling scene and object diversity in the physical world is prohibitively difficult.