LGDCNov 14, 2025

SMART: A Surrogate Model for Predicting Application Runtime in Dragonfly Systems

arXiv:2511.11111v1h-index: 15
Originality Incremental advance
AI Analysis

This work addresses workload interference prediction for high-performance computing users, offering an incremental improvement over existing hybrid simulation methods.

The paper tackles the problem of predicting application runtime in Dragonfly high-performance computing systems, which is complicated by workload interference on shared network links, and presents a surrogate model combining graph neural networks and large language models that outperforms existing baselines.

The Dragonfly network, with its high-radix and low-diameter structure, is a leading interconnect in high-performance computing. A major challenge is workload interference on shared network links. Parallel discrete event simulation (PDES) is commonly used to analyze workload interference. However, high-fidelity PDES is computationally expensive, making it impractical for large-scale or real-time scenarios. Hybrid simulation that incorporates data-driven surrogate models offers a promising alternative, especially for forecasting application runtime, a task complicated by the dynamic behavior of network traffic. We present \ourmodel, a surrogate model that combines graph neural networks (GNNs) and large language models (LLMs) to capture both spatial and temporal patterns from port level router data. \ourmodel outperforms existing statistical and machine learning baselines, enabling accurate runtime prediction and supporting efficient hybrid simulation of Dragonfly networks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes