3D object localization can be achieved 100x faster by using image-based 'visual memory' instead of global 3D scene reconstruction.
March 24, 2026
Original Paper
Memory Over Maps: 3D Object Localization Without Reconstruction
arXiv · 2603.20530
The Takeaway
The method proves that explicit global maps (point clouds/voxels) are often unnecessary for robotics tasks. By reasoning directly over posed 2D keyframes with VLMs, it achieves strong navigation performance with a fraction of the storage and preprocessing time.
From the abstract
Target localization is a prerequisite for embodied tasks such as navigation and manipulation. Conventional approaches rely on constructing explicit 3D scene representations to enable target localization, such as point clouds, voxel grids, or scene graphs. While effective, these pipelines incur substantial mapping time, storage overhead, and scalability limitations. Recent advances in vision-language models suggest that rich semantic reasoning can be performed directly on 2D observations, raising