DB IRMar 31

GRAB-ANNS: High-Throughput Indexing and Hybrid Search via GPU-Native Bucketing

Xinkui Zhao, Hengxuan Lou, Yifan Zhang, Junjie Dai, Shuiguang Deng, Jianwei Yin

arXiv:2604.1640279.8h-index: 9

AI Analysis

This work addresses the performance bottleneck of hybrid search in AI systems by leveraging GPU parallelism, offering substantial throughput gains for large-scale applications.

GRAB-ANNS introduces a GPU-native graph index for hybrid search that achieves up to 240.1x higher query throughput and 12.6x faster index construction than CPU-based systems, and up to 10x higher throughput than optimized GPU reimplementations.

Hybrid search, which jointly optimizes vector similarity and structured predicate filtering, has become a fundamental building block for modern AI-driven systems. While recent predicate-aware ANN indices improve filtering efficiency on CPUs, their performance is increasingly constrained by limited memory bandwidth and parallelism. Although GPUs offer massive parallelism and superior memory bandwidth, directly porting CPU-centric hybrid search algorithms to GPUs leads to severe performance degradation due to architectural mismatches, including irregular memory access, branch divergence, and excessive CPU-GPU synchronization. In this paper, we present GRAB-ANNS, a high-throughput, GPU-native graph index for dynamic hybrid search. Our key insight is to rethink hybrid indexing from a hardware-first perspective. We introduce a bucket-based memory layout that transforms range predicates into lightweight bucket selection, enabling coalesced memory accesses and efficient SIMT execution. To preserve global navigability under arbitrary filters, we design a hybrid graph topology that combines dense intra-bucket local edges with sparse inter-bucket remote edges. We further develop an append-only update pipeline that supports efficient batched insertions and parallel graph maintenance on GPUs. Extensive experiments on large-scale datasets show that GRAB-ANNS achieves up to 240.1 times higher query throughput and 12.6 times faster index construction than state-of-the-art CPU-based systems, and up to 10 times higher throughput compared to optimized GPU-native reimplementations, while maintaining high recall.

View on arXiv PDF

Similar