A Quick and Exact Method for Distributed Quantile Computation
For practitioners needing exact quantiles in large-scale data analytics, GK Select provides a practical alternative to expensive global sorts without sacrificing accuracy.
GK Select enables exact quantile computation in Spark with latency comparable to approximate methods, outperforming Spark's full sort by ~10.5x on 10^9 values across 120 partitions.
Quantile computation is a core primitive in large-scale data analytics. In Spark, practitioners typically rely on the Greenwald-Khanna (GK) Sketch, an approximate method. When exact quantiles are required, the default option is an expensive global sort. We present GK Select, an exact Spark algorithm that avoids full-data shuffles and completes in a constant number of actions. GK Select leverages GK Sketch to identify a near-target pivot, extracts all values within the error bound around this pivot in each partition in linear time, and then tree-reduces the resulting candidate sets. We show analytically that GK Select matches the executor-side time complexity of GK Sketch while returning the exact quantile. Empirically, GK Select achieves sketch-level latency and outperforms Spark's full sort by approximately 10.5x on 10^9 values across 120 partitions on a 30-core AWS EMR cluster.