Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization
This work addresses efficiency in large-scale machine learning optimization, offering a novel algorithm that reduces computational overhead compared to prior methods, though it is incremental in improving upon existing SVRG techniques.
The paper tackles the high computational cost of stochastic variance-reduced gradient algorithms by proposing a hybrid stochastic-deterministic minibatch proximal gradient algorithm for strongly-convex problems, achieving improved data-size-independent complexity guarantees, such as O(n^0.875 log^1.5(n)) stochastic gradient evaluations for quadratic loss at generalization error levels.
Stochastic variance-reduced gradient (SVRG) algorithms have been shown to work favorably in solving large-scale learning problems. Despite the remarkable success, the stochastic gradient complexity of SVRG-type algorithms usually scales linearly with data size and thus could still be expensive for huge data. To address this deficiency, we propose a hybrid stochastic-deterministic minibatch proximal gradient (HSDMPG) algorithm for strongly-convex problems that enjoys provably improved data-size-independent complexity guarantees. More precisely, for quadratic loss $F(θ)$ of $n$ components, we prove that HSDMPG can attain an $ε$-optimization-error $\mathbb{E}[F(θ)-F(θ^*)]\leqε$ within $\mathcal{O}\Big(\frac{κ^{1.5}ε^{0.75}\log^{1.5}(\frac{1}ε)+1}ε\wedge\Big(κ\sqrt{n}\log^{1.5}\big(\frac{1}ε\big)+n\log\big(\frac{1}ε\big)\Big)\Big)$ stochastic gradient evaluations, where $κ$ is condition number. For generic strongly convex loss functions, we prove a nearly identical complexity bound though at the cost of slightly increased logarithmic factors. For large-scale learning problems, our complexity bounds are superior to those of the prior state-of-the-art SVRG algorithms with or without dependence on data size. Particularly, in the case of $ε=\mathcal{O}\big(1/\sqrt{n}\big)$ which is at the order of intrinsic excess error bound of a learning model and thus sufficient for generalization, the stochastic gradient complexity bounds of HSDMPG for quadratic and generic loss functions are respectively $\mathcal{O} (n^{0.875}\log^{1.5}(n))$ and $\mathcal{O} (n^{0.875}\log^{2.25}(n))$, which to our best knowledge, for the first time achieve optimal generalization in less than a single pass over data. Extensive numerical results demonstrate the computational advantages of our algorithm over the prior ones.