AI & ML Breaks Assumption

The 'modality gap' in Vision-Language Models is composed of two distinct geometric components, and the commonly used 'raw gap' is a misleading metric for cross-modal quality.

April 2, 2026

Original Paper

The Geometry of Compromise: Unlocking Generative Capabilities via Controllable Modality Alignment

Hongyuan Liu, Qinli Yang, Wen Li, Zhong Zhang, Jiaming Liu, Wei Han, Zhili Qin, Jinxia Guo, Junming Shao

arXiv · 2604.00279

The Takeaway

By identifying the 'Distribution Gap' as the true predictor of performance (R² = 0.986), this paper provides a more precise framework for aligning CLIP-like models. It explains why simple centroid-shifting fails and provides a curriculum (TPC-CMA) to reshape embedding spaces for better generative task performance.

From the abstract

Vision-Language Models (VLMs) such as CLIP learn a shared embedding space for images and text, yet their representations remain geometrically separated, a phenomenon known as the modality gap. This gap limits tasks requiring cross-modal interchangeability, such as captioning and joint clustering. Existing post-processing approaches can partially improve cross-modal compatibility; however, we show through geometric analysis that they primarily reduce the global centroid offset while leaving the u