CLAIJan 23

Information Representation Fairness in Long-Document Embeddings: The Peculiar Interaction of Positional and Language Bias

arXiv:2601.16934v12 citationsh-index: 5Has Code
Originality Incremental advance
AI Analysis

This addresses fairness issues in embedding-based search for users of long documents, particularly affecting lower-resource languages, but is incremental as it builds on existing models.

The paper tackles the problem of positional and language biases in long-document embeddings, where early segments and higher-resource languages are over-represented, and introduces an inference-time attention calibration method that increases discoverability of later segments by 15-20% in experiments.

To be discoverable in an embedding-based search process, each part of a document should be reflected in its embedding representation. To quantify any potential reflection biases, we introduce a permutation-based evaluation framework. With this, we observe that state-of-the-art embedding models exhibit systematic positional and language biases when documents are longer and consist of multiple segments. Specifically, early segments and segments in higher-resource languages like English are over-represented, while later segments and segments in lower-resource languages are marginalized. In our further analysis, we find that the positional bias stems from front-loaded attention distributions in pooling-token embeddings, where early tokens receive more attention. To mitigate this issue, we introduce an inference-time attention calibration method that redistributes attention more evenly across document positions, increasing discoverabiltiy of later segments. Our evaluation framework and attention calibration is available at https://github.com/impresso/fair-sentence-transformers

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes