DSDBMay 24

CAFS: A Cache-Aware Frequency Sort for Low-Cardinality Integer Data on x86-64

arXiv:2605.250401.3
Predicted impact top 97% in DS · last 90 daysOriginality Incremental advance
AI Analysis

It provides a faster sorting method for OLAP engines and column-stores where low-cardinality integer columns are common, but the improvement is incremental and limited to a specific regime.

CAFS is a cache-aware integer sort for low-cardinality data on x86-64 with AVX2, achieving 1.7–3.1x throughput over pdqsort and 1.2–2.3x over vqsort when cardinality is much smaller than array length, with a crossover at K ≈ 6.7×10^5 against vqsort.

Integer sorts in OLAP engines often run on columns whose cardinality $K$ is much smaller than the array length $N$. After a group-by stage the intermediate key column has $K$ bounded by the number of distinct group keys, and even a column-store scan typically operates on dictionary-encoded categorical fields where $K$ never exceeds a few thousand. A comparison sort on such a column still pays $Θ(N \log N)$ comparisons, and a radix sort still pays $Θ(N \cdot B/b)$ byte passes, irrespective of $K$. This paper describes CAFS, an integer sort that does exploit it on x86-64 with AVX2. The algorithm combines a SIMD bucket sized to one cache line, a Chao1 cardinality estimator over 1024 strided samples (kept in a heap-allocated 40 KB open-addressing table), and an adaptive dispatcher backed by a spill safety guard. The hot loop is branchless and uses AVX2 cmpeq together with movemask and tzcnt to locate the matching lane. We benchmarked CAFS on a full-factorial grid of 58 array sizes $N$ from $10^3$ to $3 \cdot 10^7$ with dense $K$ schedules per $N$, producing 592770 timed runs against pdqsort, IPS4o, vqsort, ska_sort, and std::sort. In the $K \ll N$ band the throughput is 1.7 to 3.1x that of pdqsort, 1.7 to 3.5x IPS4o, and 1.2 to 2.3x vqsort. The operational crossover against pdqsort is at $K \approx 1.3 \cdot 10^5$; against ska_sort, $K \approx 8.14 \cdot 10^5$; against vqsort, $K \approx 6.7 \cdot 10^5$; and against IPS4o the curves only converge near $K = N$. Of the five baselines, only vqsort actually overtakes CAFS once the crossover is passed, which makes the vqsort threshold at $K \approx 6.7 \cdot 10^5$ the binding constraint on the operational range of CAFS.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes