Routing for Large ML Models
This work addresses network bottlenecks for researchers and engineers training large-scale ML models, offering an incremental optimization approach.
The paper tackles the problem of inefficient data communication during large model training by proposing an algorithmic framework to quantify and periodically optimize network routing, which improves overall training efficiency.
Training large language models (LLMs), and other large machine learning models, involves repeated communication of large volumes of data across a data center network. The communication patterns induced by these training process exhibit high regularity and persistence, giving rise to significant opportunities for optimizing the manner in which flows are routed across the network. We present an algorithmic framework for \textit{quantifying} network-wide efficiency in the context of training LLMs (and other large-scale ML models), and for periodically \textit{optimizing} routing with respect to this global metric.