Cuckoo-GPU: Accelerating Cuckoo Filters on Modern GPUs
This enables high-throughput systems in databases, networking, and bioinformatics to use dynamic AMQ structures on modern GPUs without performance sacrifices, representing a strong specific gain rather than an incremental improvement.
The paper tackles the performance gap between append-only and dynamic Approximate Membership Query (AMQ) structures on GPUs by introducing Cuckoo-GPU, a high-performance Cuckoo filter library that achieves insertion, query, and deletion throughputs up to 378x, 6x, and 258x higher than existing GPU-based dynamic alternatives, respectively, while rivaling the query throughput of append-only GPU-based filters.
Approximate Membership Query (AMQ) structures are essential for high-throughput systems in databases, networking, and bioinformatics. While Bloom filters offer speed, they lack support for deletions. Existing GPU-based dynamic alternatives, such as the Two-Choice Filter (TCF) and GPU Quotient Filter (GQF), enable deletions but incur severe performance penalties. We present Cuckoo-GPU, an open-source, high-performance Cuckoo filter library for GPUs. Instead of prioritizing cache locality, Cuckoo-GPU embraces the inherently random access pattern of Cuckoo hashing to fully saturate global memory bandwidth. Our design features a lock-free architecture built on atomic compare-and-swap operations, paired with a novel breadth-first search-based eviction heuristic that minimizes thread divergence and bounds sequential memory accesses during high-load insertions. Evaluated on NVIDIA GH200 (HBM3) and RTX PRO 6000 Blackwell (GDDR7) systems, Cuckoo-GPU closes the performance gap between append-only and dynamic AMQ structures. It achieves insertion, query, and deletion throughputs up to 378x (4.1x), 6x (34.7x), and 258x (107x) higher than GQF (TCF) on the same hardware, respectively, and delivers up to a 350x speedup over the fastest available multi-threaded CPU-based Cuckoo filter implementation. Moreover, its query throughput rivals that of the append-only GPU-based Blocked Bloom filter - demonstrating that dynamic AMQ structures can be deployed on modern accelerators without sacrificing performance.