Claude Brisson

34.7DBMar 16

Zero-Cost NDV Estimation from Columnar File Metadata

Claude Brisson

We present a method for estimating the number of distinct values (NDV) of a column in columnar file formats, using only existing file metadata--no extra storage, no data access. Two complementary signals are exploited: (1)~inverting the dictionary-encoded storage size equation yields accurate NDV estimates when distinct values are well-spread across row groups; (2)~counting distinct min/max values across row groups and inverting a coupon collector model provides robust estimates for sorted or partitioned data. A lightweight distribution detector routes between the two estimators. While demonstrated on Apache Parquet, the technique generalizes to any format with dictionary encoding and partition-level statistics, such as ORC and F3. Applications include cost-based query optimization, GPU memory allocation, and data profiling.

12.3DBMar 16

Partial Partial Aggregates

Claude Brisson

We introduce partial partial aggregates (PPA), a query optimization technique for distributed engines that pushes only the local compute phase of an aggregate operation through joins. A query that aggregates after a join involves two logical operations, each requiring a network shuffle. Pushing a full aggregate (COMPUTE$\rightarrow$DISTRIBUTE$\rightarrow$MERGE) below the join introduces a third shuffle. In the specific case where the join key is included in the grouping key and the join is FK-PK, the full pushed aggregate can eliminate the top-level aggregate entirely, making it the preferred choice. In all other key configurations, the top aggregate must remain, and the extra shuffle is wasteful. A PPA pushes only COMPUTE, achieving data reduction before the join without the extra shuffle. The technique relies on the distributive property of aggregates and requires accurate NDV estimation for cost-based decisions.

Claude Brisson

2 Papers