DCLGMay 30, 2018

Predictive Performance Modeling for Distributed Computing using Black-Box Monitoring and Machine Learning

arXiv:1805.11877v160 citations
Originality Synthesis-oriented
AI Analysis

It addresses the challenge of managing uncertainties in job performance for distributed systems, but it is a survey paper, making it incremental by summarizing existing work.

The paper surveys predictive performance modeling approaches for distributed computing, focusing on non-intrusive methods to estimate performance metrics like execution duration and memory usage based on past observations, without proposing new results or concrete numbers.

In many domains, the previous decade was characterized by increasing data volumes and growing complexity of computational workloads, creating new demands for highly data-parallel computing in distributed systems. Effective operation of these systems is challenging when facing uncertainties about the performance of jobs and tasks under varying resource configurations, e.g., for scheduling and resource allocation. We survey predictive performance modeling (PPM) approaches to estimate performance metrics such as execution duration, required memory or wait times of future jobs and tasks based on past performance observations. We focus on non-intrusive methods, i.e., methods that can be applied to any workload without modification, since the workload is usually a black-box from the perspective of the systems managing the computational infrastructure. We classify and compare sources of performance variation, predicted performance metrics, required training data, use cases, and the underlying prediction techniques. We conclude by identifying several open problems and pressing research needs in the field.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes