DCSEAug 17, 2015

A Comprehensive Perspective on Pilot-Job Systems

arXiv:1508.04180v382 citations
Originality Synthesis-oriented
AI Analysis

This work addresses a foundational problem for researchers and practitioners in high-performance and scientific computing by clarifying terminology and architecture to improve system robustness and interoperability.

This paper tackles the lack of standardized definitions and understanding of Pilot-Job systems in distributed scientific computing, where they process millions of jobs daily, by providing a comprehensive analysis that outlines the Pilot abstraction, its components, and properties through case studies of seven implementations.

Pilot-Job systems play an important role in supporting distributed scientific computing. They are used to consume more than 700 million CPU hours a year by the Open Science Grid communities, and by processing up to 1 million jobs a day for the ATLAS experiment on the Worldwide LHC Computing Grid. With the increasing importance of task-level parallelism in high-performance computing, Pilot-Job systems are also witnessing an adoption beyond traditional domains. Notwithstanding the growing impact on scientific research, there is no agreement upon a definition of Pilot-Job system and no clear understanding of the underlying abstraction and paradigm. Pilot-Job implementations have proliferated with no shared best practices or open interfaces and little interoperability. Ultimately, this is hindering the realization of the full impact of Pilot-Jobs by limiting their robustness, portability, and maintainability. This paper offers a comprehensive analysis of Pilot-Job systems critically assessing their motivations, evolution, properties, and implementation. The three main contributions of this paper are: (i) an analysis of the motivations and evolution of Pilot-Job systems; (ii) an outline of the Pilot abstraction, its distinguishing logical components and functionalities, its terminology, and its architecture pattern; and (iii) the description of core and auxiliary properties of Pilot-Jobs systems and the analysis of seven exemplar Pilot-Job implementations. Together, these contributions illustrate the Pilot paradigm, its generality, and how it helps to address some challenges in distributed scientific computing.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes