Distributed Stochastic Optimization via Adaptive SGD
This work addresses the problem of scaling up training for machine learning models in distributed settings, offering a robust solution that reduces hyperparameter tuning needs.
The paper tackles the challenge of parallelizing stochastic gradient descent for large-scale machine learning by proposing a distributed optimization method that combines adaptivity with variance reduction, achieving linear speedup in the number of machines, constant memory, and logarithmic communication rounds.
Stochastic convex optimization algorithms are the most popular way to train machine learning models on large-scale data. Scaling up the training process of these models is crucial, but the most popular algorithm, Stochastic Gradient Descent (SGD), is a serial method that is surprisingly hard to parallelize. In this paper, we propose an efficient distributed stochastic optimization method by combining adaptivity with variance reduction techniques. Our analysis yields a linear speedup in the number of machines, constant memory footprint, and only a logarithmic number of communication rounds. Critically, our approach is a black-box reduction that parallelizes any serial online learning algorithm, streamlining prior analysis and allowing us to leverage the significant progress that has been made in designing adaptive algorithms. In particular, we achieve optimal convergence rates without any prior knowledge of smoothness parameters, yielding a more robust algorithm that reduces the need for hyperparameter tuning. We implement our algorithm in the Spark distributed framework and exhibit dramatic performance gains on large-scale logistic regression problems.