BiXSE: Improving Dense Retrieval via Probabilistic Graded Relevance Distillation
This work addresses the need for more efficient and scalable training in dense retrieval for information retrieval systems, offering an incremental improvement by leveraging accessible graded relevance data.
The paper tackles the problem of dense retrieval models relying on binary relevance labels by proposing BiXSE, a pointwise training method that uses LLM-generated graded relevance scores as probabilistic targets, achieving strong performance across benchmarks and matching or exceeding pairwise ranking baselines.
Neural sentence embedding models for dense retrieval typically rely on binary relevance labels, treating query-document pairs as either relevant or irrelevant. However, real-world relevance often exists on a continuum, and recent advances in large language models (LLMs) have made it feasible to scale the generation of fine-grained graded relevance labels. In this work, we propose BiXSE, a simple and effective pointwise training method that optimizes binary cross-entropy (BCE) over LLM-generated graded relevance scores. BiXSE interprets these scores as probabilistic targets, enabling granular supervision from a single labeled query-document pair per query. Unlike pairwise or listwise losses that require multiple annotated comparisons per query, BiXSE achieves strong performance with reduced annotation and compute costs by leveraging in-batch negatives. Extensive experiments across sentence embedding (MMTEB) and retrieval benchmarks (BEIR, TREC-DL) show that BiXSE consistently outperforms softmax-based contrastive learning (InfoNCE), and matches or exceeds strong pairwise ranking baselines when trained on LLM-supervised data. BiXSE offers a robust, scalable alternative for training dense retrieval models as graded relevance supervision becomes increasingly accessible.