Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications
This work addresses the challenge of analyzing parallel program dynamics for researchers and developers in high-performance computing, though it appears incremental as it builds on existing techniques with a new visualization method.
The paper tackled the problem of identifying and characterizing desynchronization patterns in MPI-parallel applications by applying data analytics and machine learning techniques to performance data, showing that these patterns can be identified from a much smaller dataset than a full MPI trace.
This paper studies the utility of using data analytics and machine learning techniques for identifying, classifying, and characterizing the dynamics of large-scale parallel (MPI) programs. To this end, we run microbenchmarks and realistic proxy applications with the regular compute-communicate structure on two different supercomputing platforms and choose the per-process performance and MPI time per time step as relevant observables. Using principal component analysis, clustering techniques, correlation functions, and a new "phase space plot," we show how desynchronization patterns (or lack thereof) can be readily identified from a data set that is much smaller than a full MPI trace. Our methods also lead the way towards a more general classification of parallel program dynamics.