AI & ML Breaks Assumption

Debunks the widely held 'intra-modal misalignment hypothesis' which claimed CLIP embeddings are inherently poor for image-only tasks.

arXiv · March 18, 2026 · 2603.16100

Jonas Herzog, Yue Wang

The Takeaway

Shows that performance gaps in retrieval and classification aren't due to poor embedding geometry but rather task ambiguity, suggesting researchers should stop trying to 'fix' CLIP's internal image distances and focus on better task-specific calibration.

From the abstract

Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. In this study, we question this intra-modal misalignment hypothesis. We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics aff