Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers
This addresses efficiency challenges in extreme multi-label classification for researchers and practitioners using multi-GPU setups, but it is incremental as it builds on existing SGD and model averaging methods.
The paper tackles the problem of training deep learning models on sparse data in heterogeneous multi-GPU servers, where variance in non-zero features and GPU heterogeneity limit accuracy and increase convergence time, and shows that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy and scalability.
Motivated by extreme multi-label classification applications, we consider training deep learning models over sparse data in multi-GPU servers. The variance in the number of non-zero features across training batches and the intrinsic GPU heterogeneity combine to limit accuracy and increase the time to convergence. We address these challenges with Adaptive SGD, an adaptive elastic model averaging stochastic gradient descent algorithm for heterogeneous multi-GPUs that is characterized by dynamic scheduling, adaptive batch size scaling, and normalized model merging. Instead of statically partitioning batches to GPUs, batches are routed based on the relative processing speed. Batch size scaling assigns larger batches to the faster GPUs and smaller batches to the slower ones, with the goal to arrive at a steady state in which all the GPUs perform the same number of model updates. Normalized model merging computes optimal weights for every GPU based on the assigned batches such that the combined model achieves better accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy and is scalable with the number of GPUs.