DC AI LGSep 26, 2025

Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

Biyao Zhang, Mingkai Zheng, Debargha Ganguly, Xuecen Zhang, Vikash Singh, Vipin Chaudhary, Zhao Zhang

arXiv:2509.22832v14 citationsh-index: 5

Originality Incremental advance

AI Analysis

This work addresses a critical problem for researchers and engineers in high-performance computing by enabling efficient prediction of training times for distributed deep learning of LLMs, though it is incremental as it builds on existing modeling approaches.

The paper tackles the challenge of predicting end-to-end training time for large language models distributed across hundreds of GPUs by developing a fine-grained performance modeling framework, achieving low average prediction errors of 4.98% on Perlmutter and 9.38% on Vista for models up to 20B parameters across 128 GPUs.

Training Large Language Models(LLMs) is one of the most compute-intensive tasks in high-performance computing. Predicting end-to-end training time for multi-billion parameter models distributed across hundreds of GPUs remains challenging due to complex interactions between transformer components, parallelism strategies(data, model, pipeline, tensor), and multi-tier communication. Learned models require costly sampling, while analytical models often struggle with real-world network and hardware complexities. We address this by decomposing LLMs into core computational primitives and modeling them with: (1) operator-level decomposition for fine-grained analysis; (2) lightweight sampling based hardware-aware prediction models for key operations; (3) an end-to-end prediction system integrating these components across complex parallelization strategies. Crucially, our methodology has been validated on two large-scale HPC systems. Our framework achieves low average prediction errors-4.98\% on Perlmutter(A100) and 9.38\% on Vista(GH200)-for models up to 20B parameters across 128 GPUs. Importantly, it runs entirely on CPUs, enabling rapid iteration over hardware configurations and training strategies without costly on-cluster experimentation.

View on arXiv PDF

Similar