DBMar 16

Zero-Cost NDV Estimation from Columnar File Metadata

arXiv:2603.2460639.31 citationsh-index: 1
Predicted impact top 47% in DB · last 90 daysOriginality Incremental advance
AI Analysis

This incremental method improves cost-based query optimization, GPU memory allocation, and data profiling for users of columnar file formats like Apache Parquet.

The paper tackles the problem of estimating the number of distinct values (NDV) in columnar file formats without extra storage or data access by exploiting existing metadata, achieving accurate estimates through two complementary signals and a distribution detector.

We present a method for estimating the number of distinct values (NDV) of a column in columnar file formats, using only existing file metadata--no extra storage, no data access. Two complementary signals are exploited: (1)~inverting the dictionary-encoded storage size equation yields accurate NDV estimates when distinct values are well-spread across row groups; (2)~counting distinct min/max values across row groups and inverting a coupon collector model provides robust estimates for sorted or partitioned data. A lightweight distribution detector routes between the two estimators. While demonstrated on Apache Parquet, the technique generalizes to any format with dictionary encoding and partition-level statistics, such as ORC and F3. Applications include cost-based query optimization, GPU memory allocation, and data profiling.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes