AI & ML Efficiency Breakthrough

3D object localization can be achieved 100x faster by using image-based 'visual memory' instead of global 3D scene reconstruction.

March 24, 2026

Original Paper

Memory Over Maps: 3D Object Localization Without Reconstruction

Rui Zhou, Xander Yap, Jianwen Cao, Allison Lau, Boyang Sun, Marc Pollefeys

arXiv · 2603.20530

The Takeaway

The method proves that explicit global maps (point clouds/voxels) are often unnecessary for robotics tasks. By reasoning directly over posed 2D keyframes with VLMs, it achieves strong navigation performance with a fraction of the storage and preprocessing time.

From the abstract

Target localization is a prerequisite for embodied tasks such as navigation and manipulation. Conventional approaches rely on constructing explicit 3D scene representations to enable target localization, such as point clouds, voxel grids, or scene graphs. While effective, these pipelines incur substantial mapping time, storage overhead, and scalability limitations. Recent advances in vision-language models suggest that rich semantic reasoning can be performed directly on 2D observations, raising