Distills a 2B Vision-Language Retriever into a 70M text-only encoder for visual document retrieval with 50x lower latency.
arXiv · March 16, 2026 · 2603.12824
Why it matters
It exploits the asymmetry between complex visual documents and simple text queries to move the heavy computation to offline indexing. This allows high-quality visual document search to run on standard CPUs and edge devices with almost no loss in retrieval performance.
From the abstract
Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asym