Benchmarking Filtered Approximate Nearest Neighbor Search Algorithms on Transformer-based Embedding Vectors

Patrick Iff, Paul Bruegger, Marcin Chrapek, David Kochergin, Maciej Besta, Torsten Hoefler

arXiv:2507.2198961.24 citationsh-index: 32

Predicted impact top 28% in DB · last 90 daysOriginality Synthesis-oriented

AI Analysis

This work addresses a gap in benchmarking FANNS algorithms for real-world applications like retrieval-augmented generation, though it is incremental as it focuses on dataset creation and evaluation rather than new algorithmic contributions.

The paper tackled the lack of public datasets for filtered approximate nearest neighbor search (FANNS) on transformer-based embeddings by introducing the arxiv-for-fanns dataset with over 2.7 million arXiv paper abstracts and 11 attributes, and benchmarked eleven FANNS methods to provide eight key observations for method selection.

Advances in embedding models for text, image, audio, and video drive progress across multiple domains, including retrieval-augmented generation, recommendation systems, and others. Many of these applications require an efficient method to retrieve items that are close to a given query in the embedding space while satisfying a filter condition based on the item's attributes, a problem known as filtered approximate nearest neighbor search (FANNS). By performing an in-depth literature analysis on FANNS, we identify a key gap in the research landscape: publicly available datasets with embedding vectors from state-of-the-art transformer-based text embedding models that contain abundant real-world attributes covering a broad spectrum of attribute types and value distributions. To fill this gap, we introduce the arxiv-for-fanns dataset of transformer-based embedding vectors for the abstracts of over 2.7 million arXiv papers, enriched with 11 real-world attributes such as authors and categories. We benchmark eleven different FANNS methods on our new dataset to evaluate their performance across different filter types, numbers of retrieved neighbors, dataset scales, and query selectivities. We distill our findings into eight key observations that guide users in selecting the most suitable FANNS method for their specific use cases.

View on arXiv PDF

Similar