AI & ML Breaks Assumption

Achieves state-of-the-art compositionality in vision-language models without the need for hard negative mining or degrading zero-shot performance.

March 27, 2026

Original Paper

No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Hai X. Pham, David T. Hoffmann, Ricardo Guerrero, Brais Martinez

arXiv · 2603.25722

The Takeaway

It challenges the prevailing belief that hard negatives are necessary for learning binding/compositionality. By using concept-centric captioning and attention-pooling, it fixes structural information loss in encoders, providing a simpler and more robust path for training V&L models.

From the abstract

Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause subst

Read the original paper →

← Back to today's papers