DB AI AR LGMar 23, 2024

Efficient Data Access Paths for Mixed Vector-Relational Search

arXiv:2403.15807v16 citationsh-index: 6DaMoN

Originality Incremental advance

AI Analysis

This work addresses the challenge of optimizing data access for systems combining vector embeddings with relational metadata, which is incremental as it revisits and refines existing methods for mixed search scenarios.

The paper tackles the problem of efficiently executing mixed vector-relational search queries by analyzing alternative access paths, such as scan-based and index-based approaches, and finds that scanning is better at low selectivity while indexing is preferable at high selectivity, with the cross-point determined by data dimensionality and query concurrency.

The rapid growth of machine learning capabilities and the adoption of data processing methods using vector embeddings sparked a great interest in creating systems for vector data management. While the predominant approach of vector data management is to use specialized index structures for fast search over the entirety of the vector embeddings, once combined with other (meta)data, the search queries can also become selective on relational attributes - typical for analytical queries. As using vector indexes differs from traditional relational data access, we revisit and analyze alternative access paths for efficient mixed vector-relational search. We first evaluate the accurate but exhaustive scan-based search and propose hardware optimizations and alternative tensor-based formulation and batching to offset the cost. We outline the complex access-path design space, primarily driven by relational selectivity, and the decisions to consider when selecting an exhaustive scan-based search against an approximate index-based approach. Since the vector index primarily avoids expensive computation across the entire dataset, contrary to the common relational knowledge, it is better to scan at lower selectivity and probe at higher, with a cross-point between the two approaches dictated by data dimensionality and the number of concurrent search queries.

View on arXiv PDF

Similar