Elastic CoCoA: Scaling In to Improve Convergence
This addresses resource efficiency for distributed ML practitioners, though it's an incremental improvement on existing CoCoA methods.
The paper tackles the problem of inefficient resource allocation in distributed machine learning by showing that the optimal number of workers for CoCoA changes during training, and presents Chicle, an elastic framework that dynamically adjusts workers to achieve up to 5.96x faster time-to-accuracy compared to static settings.
In this paper we experimentally analyze the convergence behavior of CoCoA and show, that the number of workers required to achieve the highest convergence rate at any point in time, changes over the course of the training. Based on this observation, we build Chicle, an elastic framework that dynamically adjusts the number of workers based on feedback from the training algorithm, in order to select the number of workers that results in the highest convergence rate. In our evaluation of 6 datasets, we show that Chicle is able to accelerate the time-to-accuracy by a factor of up to 5.96x compared to the best static setting, while being robust enough to find an optimal or near-optimal setting automatically in most cases.