DCLGOct 21, 2025

A Distributed Framework for Causal Modeling of Performance Variability in GPU Traces

arXiv:2510.18300v1h-index: 3
Originality Incremental advance
AI Analysis

This work addresses performance analysis inefficiencies for HPC researchers and engineers, but it is incremental as it builds on existing causal graph methods with parallelization.

The paper tackled the challenge of analyzing large-scale GPU traces for performance bottlenecks in HPC by developing a distributed framework that partitions and processes trace data concurrently, resulting in a 67% improvement in scalability.

Large-scale GPU traces play a critical role in identifying performance bottlenecks within heterogeneous High-Performance Computing (HPC) architectures. However, the sheer volume and complexity of a single trace of data make performance analysis both computationally expensive and time-consuming. To address this challenge, we present an end-to-end parallel performance analysis framework designed to handle multiple large-scale GPU traces efficiently. Our proposed framework partitions and processes trace data concurrently and employs causal graph methods and parallel coordinating chart to expose performance variability and dependencies across execution flows. Experimental results demonstrate a 67% improvement in terms of scalability, highlighting the effectiveness of our pipeline for analyzing multiple traces independently.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes