A training-free method to fix intra-modal misalignment in CLIP by decomposing projectors into an isotropic aligned subspace.
March 23, 2026
Original Paper
IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment
arXiv · 2603.19862
The Takeaway
CLIP is notoriously poor at intra-modal tasks (like image-to-image retrieval). This method provides a mathematical way to strip away 'anisotropic directions' from the weights without any additional training, significantly improving retrieval accuracy and reducing latency.
From the abstract
Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing