A Multi-Level, Multi-Scale Visual Analytics Approach to Assessment of Multifidelity HPC Systems
This addresses the challenge of improving robustness and reliability in high-performance computing facilities by enabling better log analysis, though it appears incremental as it builds on existing techniques like mrDMD.
The paper tackles the problem of monitoring and interpreting complex supercomputing systems by developing a holistic analytical system that processes massive multi-fidelity log data, using improved multiresolution dynamic mode decomposition and visual analytics to extract usage and error patterns at varying resolutions, exemplified with scenarios on a Cray XC40 supercomputer.
The ability to monitor and interpret of hardware system events and behaviors are crucial to improving the robustness and reliability of these systems, especially in a supercomputing facility. The growing complexity and scale of these systems demand an increase in monitoring data collected at multiple fidelity levels and varying temporal resolutions. In this work, we aim to build a holistic analytical system that helps make sense of such massive data, mainly the hardware logs, job logs, and environment logs collected from disparate subsystems and components of a supercomputer system. This end-to-end log analysis system, coupled with visual analytics support, allows users to glean and promptly extract supercomputer usage and error patterns at varying temporal and spatial resolutions. We use multiresolution dynamic mode decomposition (mrDMD), a technique that depicts high-dimensional data as correlated spatial-temporal variations patterns or modes, to extract variation patterns isolated at specified frequencies. Our improvements to the mrDMD algorithm help promptly reveal useful information in the massive environment log dataset, which is then associated with the processed hardware and job log datasets using our visual analytics system. Furthermore, our system can identify the usage and error patterns filtered at user, project, and subcomponent levels. We exemplify the effectiveness of our approach with two use scenarios with the Cray XC40 supercomputer.