DBMar 16

Workload-Aware Incremental Reclustering in Cloud Data Warehouses

arXiv:2602.2328910.3h-index: 6
AI Analysis

This addresses the need for flexible and cost-effective data management in dynamic cloud environments with continuous data ingestion and evolving workloads, representing an incremental improvement over existing automatic clustering approaches.

The paper tackles the problem of maintaining data clustering in cloud data warehouses for efficient query processing by proposing WAIR, a workload-aware algorithm that reclusters only critical boundary micro-partitions, achieving near-optimal query performance with significantly lower reclustering costs.

Modern cloud data warehouses store data in micro-partitions and rely on metadata (e.g., zonemaps) for efficient data pruning during query processing. Maintaining data clustering in a large-scale table is crucial for effective data pruning. Existing automatic clustering approaches lack the flexibility required in dynamic cloud environments with continuous data ingestion and evolving workloads. This paper advocates a clean separation between reclustering policy and clustering-key selection. We introduce the concept of boundary micro-partitions that sit on the boundary of query ranges. We then present WAIR, a workload-aware algorithm to identify and recluster only boundary micro-partitions most critical for pruning efficiency. WAIR achieves near-optimal (with respect to fully sorted table layouts) query performance but incurs significantly lower reclustering cost with a theoretical upper bound. We further implement the algorithm into a prototype reclustering service and evaluate on standard benchmarks (TPC-H, DSB) and a real-world workload. Results show that WAIR improves query performance and reduces the overall cost compared to existing solutions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes