Introduces a paradigm for vision-language navigation that uses ubiquitously available semantic floor plans as global spatial priors.
arXiv · March 19, 2026 · 2603.17437
The Takeaway
Standard VLN agents rely solely on local visual cues and verbose instructions; this method allows agents to navigate complex buildings using only concise commands and a 2D floor plan. It achieves a 60% relative improvement in success rate by bridging the gap between structured schematics and visual observations.
From the abstract
Existing Vision-Language Navigation (VLN) task requires agents to follow verbose instructions, ignoring some potentially useful global spatial priors, limiting their capability to reason about spatial structures. Although human-readable spatial schematics (e.g., floor plans) are ubiquitous in real-world buildings, current agents lack the cognitive ability to comprehend and utilize them. To bridge this gap, we introduce \textbf{FloorPlan-VLN}, a new paradigm that leverages structured semantic flo