Transitions MLLMs from reactive planning to 'mental navigation' by forcing the construction of hierarchical cognitive maps from egocentric video.
March 24, 2026
Original Paper
Mind over Space: Can Multimodal Large Language Models Mentally Navigate?
arXiv · 2603.21577
The Takeaway
Standard MLLMs fail at spatial reasoning over long horizons; NavMind introduces a paradigm where models internalize spatial representations and simulate paths prior to action. This bridges a critical gap between reactive AI and biological spatial intelligence.
From the abstract
Despite the widespread adoption of MLLMs in embodied agents, their capabilities remain largely confined to reactive planning from immediate observations, consistently failing in spatial reasoning across extensive spatiotemporal scales. Cognitive science reveals that Biological Intelligence (BI) thrives on "mental navigation": the strategic construction of spatial representations from experience and the subsequent mental simulation of paths prior to action. To bridge the gap between AI and BI, we