$\mathbb{R}^{2k}$ is Theoretically Large Enough for Embedding-based Top-$k$ Retrieval

Zihao Wang, Hang Yin, Lihui Liu, Hanghang Tong, Yangqiu Song, Ginny Wong, Simon See

arXiv:2601.20844v21.4h-index: 13

Originality Incremental advance

AI Analysis

This addresses a foundational geometric constraint in embedding-based retrieval for machine learning and information retrieval, providing theoretical insights that could guide algorithm design.

The paper tackles the problem of determining the minimal dimension needed to embed subsets for top-k retrieval, deriving tight theoretical bounds and supporting them empirically for various distance metrics, showing that logarithmic dimension suffices.

This paper studies the minimal dimension required to embed subset memberships ($m$ elements and ${m\choose k}$ subsets of at most $k$ elements) into vector spaces, denoted as Minimal Embeddable Dimension (MED). The tight bounds of MED are derived theoretically and supported empirically for various notions of "distances" or "similarities," including the $\ell_2$ metric, inner product, and cosine similarity. In addition, we conduct numerical simulation in a more achievable setting, where the ${m\choose k}$ subset embeddings are chosen as the centroid of the embeddings of the contained elements. Our simulation easily realizes a logarithmic dependency between the MED and the number of elements to embed. These findings imply that embedding-based retrieval limitations stem primarily from learnability challenges, not geometric constraints, guiding future algorithm design.

View on arXiv PDF

Similar