DCApr 12

PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training

Alicia Golden, Michael Kuchnik, Samuel Hsia, Zachary DeVito, Gu-Yeon Wei, David Brooks, Carole-Jean Wu

arXiv:2510.1559692.51 citationsh-index: 9

AI Analysis

For engineers and researchers scaling distributed training to tens of thousands of GPUs, PRISM provides a principled method to account for stochastic runtime variations, improving system efficiency.

The paper addresses performance variability in large-scale distributed training, showing 9% GPU time variability at 64,000+ GPU scale, and introduces PRISM, a probabilistic performance modeling framework that quantifies training time guarantees and enables variability-aware optimization.

Large model training beyond tens of thousands of GPUs is an uncharted territory. At such scales, disruptions to the training process are not a matter of if, but a matter of when -- a stochastic process degrading training productivity. Dynamic runtime variation will become increasingly more frequent as training scales up and as GPUs are operated in increasingly power-limited and thermally-stressed environments. At the 64,000+ GPU scale, we already observe 9% GPU time variability for frontier foundation model training. Motivated by our analysis and the large design space around performance variability, we present PRISM -- a performance modeling framework that captures the stochastic nature of large-scale distributed training. The core of PRISM is a statistical method that quantifies probabilistic guarantees on training time. Using PRISM, we explore the design and optimization space of distributed training, enabling principled, variability-aware decisions that improve performance and system efficiency at scale.

View on arXiv PDF

Similar