AI & ML Open Release

Releases a massive 117k-instruction dataset and a language-conditioned world model framework for visual navigation.

March 31, 2026

Original Paper

Language-Conditioned World Modeling for Visual Navigation

Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, Qiyu Hu, Tianyu Wang, Johnalbert Garnica, Feng Liu, Siyu Huang, Qi Dai, Zhi-Qi Cheng

arXiv · 2603.26741

The Takeaway

It democratizes the training of agents that can 'imagine' future states based on verbal instructions. The combination of a diffusion-based world model and an actor-critic agent in latent space provides a new baseline for language-grounded embodied AI.

From the abstract

We study language-conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open-loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,01