DBMar 16

Partial Partial Aggregates

arXiv:2603.2669817.1h-index: 1
AI Analysis

This is an incremental optimization for distributed database systems, addressing query efficiency issues for users handling complex joins and aggregates.

The paper tackles the problem of inefficient query execution in distributed engines when aggregating after joins, which typically requires multiple network shuffles. It introduces partial partial aggregates (PPA) to push only local compute phases through joins, reducing data without extra shuffles and improving performance in non-FK-PK key configurations.

We introduce partial partial aggregates (PPA), a query optimization technique for distributed engines that pushes only the local compute phase of an aggregate operation through joins. A query that aggregates after a join involves two logical operations, each requiring a network shuffle. Pushing a full aggregate (COMPUTE$\rightarrow$DISTRIBUTE$\rightarrow$MERGE) below the join introduces a third shuffle. In the specific case where the join key is included in the grouping key and the join is FK-PK, the full pushed aggregate can eliminate the top-level aggregate entirely, making it the preferred choice. In all other key configurations, the top aggregate must remain, and the extra shuffle is wasteful. A PPA pushes only COMPUTE, achieving data reduction before the join without the extra shuffle. The technique relies on the distributive property of aggregates and requires accurate NDV estimation for cost-based decisions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes