Provides a causal explanation for 'embedding collapse' in Transformers, linking it to the concept of semantic shift rather than just text length.
March 24, 2026
Original Paper
Semantic Shift: the Fundamental Challenge in Text Embedding and Retrieval
arXiv · 2603.21437
The Takeaway
It formalizes why pooled representations become less discriminative as text diversity increases (semantic smoothing). This offers practitioners a theoretical and actionable lens for diagnosing when anisotropy will harm retrieval performance in RAG systems.
From the abstract
Transformer-based embedding models rely on pooling to map variable-length text into a single vector, enabling efficient similarity search but also inducing well-known geometric pathologies such as anisotropy and length-induced embedding collapse. Existing accounts largely describe \emph{what} these pathologies look like, yet provide limited insight into \emph{when} and \emph{why} they harm downstream retrieval. In this work, we argue that the missing causal factor is \emph{semantic shift}: the i