IRAIOct 25, 2024

pEBR: A Probabilistic Approach to Embedding Based Retrieval

arXiv:2410.19349v32 citationsh-index: 7EMNLP
Originality Incremental advance
AI Analysis

This addresses retrieval inefficiencies in industrial systems, particularly for head and tail queries, though it appears incremental as it builds on existing embedding-based methods with a probabilistic twist.

The paper tackles the problem of fixed-size retrieval in embedding-based systems, which leads to insufficient recall for head queries and low precision for tail queries, by proposing a probabilistic framework (pEBR) that models item distributions per query and uses dynamic thresholds; experimental results show significant improvements in both retrieval precision and recall.

Embedding-based retrieval aims to learn a shared semantic representation space for both queries and items, enabling efficient and effective item retrieval through approximate nearest neighbor (ANN) algorithms. In current industrial practice, retrieval systems typically retrieve a fixed number of items for each query. However, this fixed-size retrieval often results in insufficient recall for head queries and low precision for tail queries. This limitation largely stems from the dominance of frequentist approaches in loss function design, which fail to address this challenge in industry. In this paper, we propose a novel \textbf{p}robabilistic \textbf{E}mbedding-\textbf{B}ased \textbf{R}etrieval (\textbf{pEBR}) framework. Our method models the item distribution conditioned on each query, enabling the use of a dynamic cosine similarity threshold derived from the cumulative distribution function (CDF) of the probabilistic model. Experimental results demonstrate that pEBR significantly improves both retrieval precision and recall. Furthermore, ablation studies reveal that the probabilistic formulation effectively captures the inherent differences between head-to-tail queries.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes