AI & ML Efficiency Breakthrough

Distills a 2B Vision-Language Retriever into a 70M text-only encoder for visual document retrieval with 50x lower latency.

arXiv · March 16, 2026 · 2603.12824

Zhuchenyang Liu, Yao Zhang, Yu Xiao

Why it matters

It exploits the asymmetry between complex visual documents and simple text queries to move the heavy computation to offline indexing. This allows high-quality visual document search to run on standard CPUs and edge devices with almost no loss in retrieval performance.

From the abstract

Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asym