Releases a massive 117k-instruction dataset and a language-conditioned world model framework for visual navigation.
March 31, 2026
Original Paper
Language-Conditioned World Modeling for Visual Navigation
arXiv · 2603.26741
The Takeaway
It democratizes the training of agents that can 'imagine' future states based on verbal instructions. The combination of a diffusion-based world model and an actor-critic agent in latent space provides a new baseline for language-grounded embodied AI.
From the abstract
We study language-conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open-loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,01