DCAIJun 17, 2025

ClusterRCA: An End-to-End Approach for Network Fault Localization and Classification for HPC System

arXiv:2506.20673v21 citationsh-index: 18ISSRE
Originality Incremental advance
AI Analysis

This addresses network fault localization for HPC systems, which is critical but challenging due to data heterogeneity, and is incremental as it combines existing classifier-based and graph-based approaches.

The paper tackles network failure diagnosis in high-performance computing systems by proposing ClusterRCA, an end-to-end framework that localizes culprit nodes and classifies failure types using multimodal data, achieving high accuracy in experiments with data from a top-tier HPC vendor.

Network failure diagnosis is challenging yet critical for high-performance computing (HPC) systems. Existing methods cannot be directly applied to HPC scenarios due to data heterogeneity and lack of accuracy. This paper proposes a novel framework, called ClusterRCA, to localize culprit nodes and determine failure types by leveraging multimodal data. ClusterRCA extracts features from topologically connected network interface controller (NIC) pairs to analyze the diverse, multimodal data in HPC systems. To accurately localize culprit nodes and determine failure types, ClusterRCA combines classifier-based and graph-based approaches. A failure graph is constructed based on the output of the state classifier, and then it performs a customized random walk on the graph to localize the root cause. Experiments on datasets collected by a top-tier global HPC device vendor show ClusterRCA achieves high accuracy in diagnosing network failure for HPC systems. ClusterRCA also maintains robust performance across different application scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes