NILGMar 7, 2025

Routing for Large ML Models

arXiv:2503.05324v1h-index: 14Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses network bottlenecks for researchers and engineers training large-scale ML models, offering an incremental optimization approach.

The paper tackles the problem of inefficient data communication during large model training by proposing an algorithmic framework to quantify and periodically optimize network routing, which improves overall training efficiency.

Training large language models (LLMs), and other large machine learning models, involves repeated communication of large volumes of data across a data center network. The communication patterns induced by these training process exhibit high regularity and persistence, giving rise to significant opportunities for optimizing the manner in which flows are routed across the network. We present an algorithmic framework for \textit{quantifying} network-wide efficiency in the context of training LLMs (and other large-scale ML models), and for periodically \textit{optimizing} routing with respect to this global metric.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes