Biased Local SGD for Efficient Deep Learning on Heterogeneous Systems
This addresses the challenge of underutilizing slower resources like CPUs in heterogeneous systems, offering a practical solution for cloud platforms and HPC clusters, though it is incremental as it builds on local SGD.
The paper tackles the problem of inefficient neural network training on heterogeneous computing systems by proposing a biased local SGD method that allocates workloads proportionally to compute capacity, achieving comparable or higher accuracy than synchronous SGD within the same time budget.
Most large-scale neural network training methods assume homogeneous parallel computing resources. For example, synchronous SGD with data parallelism, the most widely used parallel training strategy, incurs significant synchronization overhead when workers process their assigned data at different speeds. Consequently, in systems with heterogeneous compute resources, users often rely solely on the fastest components, such as GPUs, for training. In this work, we explore how to effectively use heterogeneous resources for neural network training. We propose a system-aware local stochastic gradient descent (local SGD) method that allocates workloads to each compute resource in proportion to its compute capacity. To make better use of slower resources such as CPUs, we intentionally introduce bias into data sampling and model aggregation. Our study shows that well-controlled bias can significantly accelerate local SGD in heterogeneous environments, achieving comparable or even higher accuracy than synchronous SGD with data-parallelism within the same time budget. This fundamental parallelization strategy can be readily extended to diverse heterogeneous environments, including cloud platforms and multi-node high-performance computing clusters.