CLIRMar 22

Semantic Shift: the Fundamental Challenge in Text Embedding and Retrieval

arXiv:2603.2143784.2h-index: 10
AI Analysis

This addresses a fundamental challenge in text retrieval for AI and ML practitioners, offering a diagnostic tool for embedding quality, but it is incremental as it builds on existing analyses of embedding pathologies.

The paper tackles the problem of geometric pathologies like anisotropy and embedding collapse in transformer-based text embeddings, identifying semantic shift as the causal factor and showing it predicts retrieval degradation better than text length alone.

Transformer-based embedding models rely on pooling to map variable-length text into a single vector, enabling efficient similarity search but also inducing well-known geometric pathologies such as anisotropy and length-induced embedding collapse. Existing accounts largely describe \emph{what} these pathologies look like, yet provide limited insight into \emph{when} and \emph{why} they harm downstream retrieval. In this work, we argue that the missing causal factor is \emph{semantic shift}: the intrinsic, structured evolution and dispersion of semantics within a text. We first present a theoretical analysis of \emph{semantic smoothing} in Transformer embeddings: as the semantic diversity among constituent sentences increases, the pooled representation necessarily shifts away from every individual sentence embedding, yielding a smoothed and less discriminative vector. Building on this foundation, we formalize semantic shift as a computable measure integrating local semantic evolution and global semantic dispersion. Through controlled experiments across corpora and multiple embedding models, we show that semantic shift aligns closely with the severity of embedding concentration and predicts retrieval degradation, whereas text length alone does not. Overall, semantic shift offers a unified and actionable lens for understanding embedding collapse and for diagnosing when anisotropy becomes harmful.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes