OC LGJun 7, 2024

Efficient Continual Finite-Sum Minimization

Ioannis Mavrothalassitis, Stratis Skoulakis, Leello Tadesse Dadi, Volkan Cevher

arXiv:2406.04731v13.2

Originality Incremental advance

AI Analysis

This work addresses optimization efficiency for sequential data processing in machine learning, offering a nearly tight complexity bound, though it is incremental relative to existing variance reduction techniques.

The paper tackles the problem of continual finite-sum minimization, where a sequence of points must minimize prefix-sums of functions, and proposes a first-order method (CSVRG) that achieves an ε-optimal sequence with Õ(n/ε^{1/3} + 1/√ε) first-order oracles, improving over prior methods like Stochastic Gradient Descent and Katyusha.

Given a sequence of functions $f_1,\ldots,f_n$ with $f_i:\mathcal{D}\mapsto \mathbb{R}$, finite-sum minimization seeks a point ${x}^\star \in \mathcal{D}$ minimizing $\sum_{j=1}^n f_j(x)/n$. In this work, we propose a key twist into the finite-sum minimization, dubbed as continual finite-sum minimization, that asks for a sequence of points ${x}_1^\star,\ldots,{x}_n^\star \in \mathcal{D}$ such that each ${x}^\star_i \in \mathcal{D}$ minimizes the prefix-sum $\sum_{j=1}^if_j(x)/i$. Assuming that each prefix-sum is strongly convex, we develop a first-order continual stochastic variance reduction gradient method ($\mathrm{CSVRG}$) producing an $ε$-optimal sequence with $\mathcal{\tilde{O}}(n/ε^{1/3} + 1/\sqrtε)$ overall first-order oracles (FO). An FO corresponds to the computation of a single gradient $\nabla f_j(x)$ at a given $x \in \mathcal{D}$ for some $j \in [n]$. Our approach significantly improves upon the $\mathcal{O}(n/ε)$ FOs that $\mathrm{StochasticGradientDescent}$ requires and the $\mathcal{O}(n^2 \log (1/ε))$ FOs that state-of-the-art variance reduction methods such as $\mathrm{Katyusha}$ require. We also prove that there is no natural first-order method with $\mathcal{O}\left(n/ε^α\right)$ gradient complexity for $α< 1/4$, establishing that the first-order complexity of our method is nearly tight.

View on arXiv PDF

Similar