Achieves state-of-the-art vision-language pretraining using 300x less data than leading methods.
March 27, 2026
Original Paper
GoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image Pretraining
arXiv · 2603.24804
The Takeaway
GoldiCLIP demonstrates that high-quality supervision (self-distillation and VQA objectives) allows training competitive VLMs on just 30M images. This democratizes high-performance multimodal pretraining for researchers without access to billion-scale datasets.
From the abstract
Until recently, the success of large-scale vision-language models (VLMs) has primarily relied on billion-sample datasets, posing a significant barrier to progress. Latest works have begun to close this gap by improving supervision quality, but each addresses only a subset of the weaknesses in contrastive pretraining. We present GoldiCLIP, a framework built on a Goldilocks principle of finding the right balance of supervision signals. Our multifaceted training framework synergistically combines t