Hydra unifies ColBERT-style retrieval and autoregressive generation into a single Vision-Language Model using a single LoRA adapter.
March 31, 2026
Original Paper
Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model
arXiv · 2603.28554
The Takeaway
It reduces GPU memory by 41% by eliminating the need for separate retrieval and generation models, while achieving 100% byte-identical generation quality compared to standalone base models.
From the abstract
Visual document understanding typically requires separate retrieval and generation models, doubling memory and system complexity. We present Hydra, a dual-head approach that provides both ColBERT-style late-interaction retrieval and autoregressive generation from a single vision-language model (VLM). A single LoRA adapter, trained only for retrieval, is toggled at inference: enabling it produces multi-vector embeddings; disabling it recovers the base model's generation quality -- byte-identical