A Study on the Resource Utilization and User Behavior on Titan Supercomputer
For HPC facility administrators and designers, this work offers a methodology to understand usage patterns and improve future exascale system design, though it is an incremental application of data science methods to a specific dataset.
This study analyzes resource utilization and user behavior on the Titan supercomputer using system logs, GPU traces, and scientific area data, providing insights into seasonality and a predictive model for forecasting utilization.
Understanding HPC facilities users' behaviors and how computational resources are requested and utilized is not only crucial for the cluster productivity but also essential for designing and constructing future exascale HPC systems. This paper tackles Challenge 4, 'Analyzing Resource Utilization and User Behavior on Titan Supercomputer', of the 2021 Smoky Mountains Conference Data Challenge. Specifically, we dig deeper inside the records of Titan to discover patterns and extract relationships. This paper explores the workload distribution and usage patterns from resource manager system logs, GPU traces, and scientific areas information collected from the Titan supercomputer. Furthermore, we want to know how resource utilization and user behaviors change over time. Using data science methods, such as correlations, clustering, or neural networks, our findings allow us to investigate how projects, jobs, nodes, GPUs and memory are related. We provide insights about seasonality usage of resources and a predictive model for forecasting utilization of Titan Supercomputer. In addition, the described methodology can be easily adopted in other HPC clusters.