DBLGJun 19, 2025

Data-Agnostic Cardinality Learning from Imperfect Workloads

arXiv:2506.16007v11 citationsh-index: 42Proc VLDB Endow
Originality Highly original
AI Analysis

This addresses a practical problem for database systems where organizational policies limit data access, offering a robust solution for query optimization in such scenarios.

The paper tackles the problem of cardinality estimation under real-world constraints where data access is restricted and training workloads are imperfect, presenting GRASP which outperforms existing query-driven models and achieves performance comparable to traditional data-based approaches on benchmarks.

Cardinality estimation (CardEst) is a critical aspect of query optimization. Traditionally, it leverages statistics built directly over the data. However, organizational policies (e.g., regulatory compliance) may restrict global data access. Fortunately, query-driven cardinality estimation can learn CardEst models using query workloads. However, existing query-driven models often require access to data or summaries for best performance, and they assume perfect training workloads with complete and balanced join templates (or join graphs). Such assumptions rarely hold in real-world scenarios, in which join templates are incomplete and imbalanced. We present GRASP, a data-agnostic cardinality learning system designed to work under these real-world constraints. GRASP's compositional design generalizes to unseen join templates and is robust to join template imbalance. It also introduces a new per-table CardEst model that handles value distribution shifts for range predicates, and a novel learned count sketch model that captures join correlations across base relations. Across three database instances, we demonstrate that GRASP consistently outperforms existing query-driven models on imperfect workloads, both in terms of estimation accuracy and query latency. Remarkably, GRASP achieves performance comparable to, or even surpassing, traditional approaches built over the underlying data on the complex CEB-IMDb-full benchmark -- despite operating without any data access and using only 10% of all possible join templates.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes