AI & ML New Capability

Introduces a paradigm for vision-language navigation that uses ubiquitously available semantic floor plans as global spatial priors.

arXiv · March 19, 2026 · 2603.17437

Kehan Chen, Yan Huang, Dong An, Jiawei He, Yifei Su, Jing Liu, Nianfeng Liu, Liang Wang

The Takeaway

Standard VLN agents rely solely on local visual cues and verbose instructions; this method allows agents to navigate complex buildings using only concise commands and a 2D floor plan. It achieves a 60% relative improvement in success rate by bridging the gap between structured schematics and visual observations.

From the abstract

Existing Vision-Language Navigation (VLN) task requires agents to follow verbose instructions, ignoring some potentially useful global spatial priors, limiting their capability to reason about spatial structures. Although human-readable spatial schematics (e.g., floor plans) are ubiquitous in real-world buildings, current agents lack the cognitive ability to comprehend and utilize them. To bridge this gap, we introduce \textbf{FloorPlan-VLN}, a new paradigm that leverages structured semantic flo