Enel: Context-Aware Dynamic Scaling of Distributed Dataflow Jobs using Graph Propagation
This work addresses runtime management for data analytics in distributed systems, offering an incremental improvement over existing scaling methods.
The paper tackles the problem of dynamic scaling for distributed dataflow jobs to meet runtime targets despite performance variance, presenting Enel, which uses graph propagation to model jobs and derive rescaling decisions, achieving effective reactions to events like node failures in evaluations with Spark jobs.
Distributed dataflow systems like Spark and Flink enable the use of clusters for scalable data analytics. While runtime prediction models can be used to initially select appropriate cluster resources given target runtimes, the actual runtime performance of dataflow jobs depends on several factors and varies over time. Yet, in many situations, dynamic scaling can be used to meet formulated runtime targets despite significant performance variance. This paper presents Enel, a novel dynamic scaling approach that uses message propagation on an attributed graph to model dataflow jobs and, thus, allows for deriving effective rescaling decisions. For this, Enel incorporates descriptive properties that capture the respective execution context, considers statistics from individual dataflow tasks, and propagates predictions through the job graph to eventually find an optimized new scale-out. Our evaluation of Enel with four iterative Spark jobs shows that our approach is able to identify effective rescaling actions, reacting for instance to node failures, and can be reused across different execution contexts.