DC AI LGNov 2, 2024

Data movement limits to frontier model training

arXiv:2411.01137v25.95 citationsh-index: 7

Originality Incremental advance

AI Analysis

This identifies fundamental barriers to scaling large AI models, which is critical for researchers and companies pushing the frontiers of model size and performance.

The authors analyzed scaling limits for frontier model training by modeling distributed training and identifying data movement bottlenecks. They found that training runs exceeding about 10^28 FLOP would face significant hardware utilization issues within three months, with runs above 10^31 FLOP becoming infeasible, but suggested potential solutions like aggressive batch size scaling or different model shapes.

We present a theoretical model of distributed training, and use it to analyze how far dense and sparse training runs can be scaled. Under our baseline assumptions, given a three month training duration, data movement bottlenecks begin to significantly lower hardware utilization for training runs exceeding about $10^{28}$ FLOP, two orders of magnitude above the largest training run to date, suggesting the arrival of fundamental barriers to scaling in three years given recent rates of growth. A training run exceeding about $10^{31}$ FLOP is infeasible even at low utilization. However, more aggressive batch size scaling and/or shorter and fatter model shapes, if achievable, have the potential to permit much larger training runs.

View on arXiv PDF

Similar