Automates the entire robot training pipeline by using video generation models as motion priors to synthesize both simulation environments and expert trajectories.
arXiv · March 20, 2026 · 2603.18811
The Takeaway
This framework eliminates the need for manual asset curation and heuristic-based motion planning in robotics. By leveraging the rich priors of video models, it generates executable expert data from natural language, facilitating zero-shot sim-to-real transfer for novel objects.
From the abstract
Training generalist robots demands large-scale, diverse manipulation data, yet real-world collection is prohibitively expensive, and existing simulators are often constrained by fixed asset libraries and manual heuristics. To bridge this gap, we present V-Dreamer, a fully automated framework that generates open-vocabulary, simulation-ready manipulation environments and executable expert trajectories directly from natural language instructions. V-Dreamer employs a novel generative pipeline that c