DBMay 9

Elastic Scheduling of Intermittent Query Processing in a Cluster Environment

arXiv:2605.086013.1
Predicted impact top 96% in DB · last 90 daysOriginality Incremental advance
AI Analysis

For stream processing applications with deadlines, this work addresses the need for cost-efficient, elastic scheduling in parallel environments, handling multiple queries and input rate variations.

The paper proposes elastic scheduling schemes for intermittent query processing in a cluster, ensuring deadlines are met while minimizing cost. Experiments on Apache Spark with TPC-H and Yahoo Streaming datasets show significant cost reduction compared to fixed-node or Spark streaming alternatives.

Many applications process a stream of tuples over a window duration, and require the results within a specified deadline after the end of the window. For such scenarios, processing tuples intermittently (in batches) instead of eagerly processing tuples as they arrive significantly reduces the overall cost. Earlier work on intermittent query processing has addressed only fixed environments. In this paper, we propose scheduling schemes for batched processing of tuples, in an elastic parallel environment, scaling nodes up or down. Our scheduling schemes ensure to meet the deadlines, while incurring minimum cost. Our schemes also handle multiple concurrent queries, the arrival of new queries, and input rate variations. We have implemented our schemes on top of Apache Spark, in the AWS EMR environment, and evaluated performance with both TPC-H and Yahoo Streaming datasets. Our experimental results show that our scheduling algorithms significantly outperform alternatives, such as using a fixed set of nodes without elasticity, or using Spark streaming.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes