AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism
This addresses scalability issues for researchers and practitioners training large models, though it is incremental by building on existing parallelism strategies.
The paper tackled the communication bottleneck in distributed neural network training by introducing asynchronous updates for data and pipeline parallelism, achieving performance matching synchronous baselines while reducing communication overhead in experiments with up to 1B parameter models.
Data and pipeline parallelism are key strategies for scaling neural network training across distributed devices, but their high communication cost necessitates co-located computing clusters with fast interconnects, limiting their scalability. We address this communication bottleneck by introducing asynchronous updates across both parallelism axes, relaxing the co-location requirement at the expense of introducing staleness between pipeline stages and data parallel replicas. To mitigate staleness, for pipeline parallelism, we adopt a weight look-ahead approach, and for data parallelism, we introduce an asynchronous sparse averaging method equipped with an exponential moving average based correction mechanism. We provide convergence guarantees for both sparse averaging and asynchronous updates. Experiments on large-scale language models (up to \em 1B parameters) demonstrate that our approach matches the performance of the fully synchronous baseline, while significantly reducing communication overhead.