DS DBMar 25

AutoCSF: Provably Space-Efficient Indexing of Skewed Key-Value Workloads via Filter-Augmented Compressed Static Functions

David Torres Ramos, Vihan Lakshman, Chen Luo, Todd Treangen, Benjamin Coleman

arXiv:2603.2488297.1h-index: 22Has Code

AI Analysis

This addresses a critical challenge in data-intensive domains like computational genomics, where skewed distributions dominate, offering a principled solution with theoretical guarantees.

The paper tackled the problem of building space-efficient in-memory indexes for massive key-value datasets with skewed value distributions, introducing AutoCSF to provably achieve near-optimal space usage with low query latency.

We study the problem of building space-efficient, in-memory indexes for massive key-value datasets with highly skewed value distributions. This challenge arises in many data-intensive domains and is particularly acute in computational genomics, where $k$-mer count tables can contain billions of entries dominated by a single frequent value. While recent work has proposed to address this problem by augmenting compressed static functions (CSFs) with pre-filters, existing approaches rely on complex heuristics and lack formal guarantees. In this paper, we introduce a principled algorithm, called AutoCSF, for combining CSFs with pre-filtering to provably handle skewed distributions with near-optimal space usage. We improve upon prior CSF pre-filtering constructions by (1) deriving a mathematically rigorous decision criterion for when filter augmentation is beneficial; (2) presenting a general algorithmic framework for integrating CSFs with modern set membership data structures beyond the classic Bloom filter; and (3) establishing theoretical guarantees on the overall space usage of the resulting indexes. Our open-source implementation of AutoCSF demonstrates space savings over baseline methods while maintaining low query latency.

View on arXiv PDF Code

Similar