Debunks the widely held 'intra-modal misalignment hypothesis' which claimed CLIP embeddings are inherently poor for image-only tasks.
arXiv · March 18, 2026 · 2603.16100
The Takeaway
Shows that performance gaps in retrieval and classification aren't due to poor embedding geometry but rather task ambiguity, suggesting researchers should stop trying to 'fix' CLIP's internal image distances and focus on better task-specific calibration.
From the abstract
Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. In this study, we question this intra-modal misalignment hypothesis. We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics aff