Achieves state-of-the-art compositionality in vision-language models without the need for hard negative mining or degrading zero-shot performance.
March 27, 2026
Original Paper
No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models
arXiv · 2603.25722
The Takeaway
It challenges the prevailing belief that hard negatives are necessary for learning binding/compositionality. By using concept-centric captioning and attention-pooling, it fixes structural information loss in encoders, providing a simpler and more robust path for training V&L models.
From the abstract
Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause subst