MLJun 14, 2022
A Stochastic Proximal Method for Nonsmooth Regularized Finite Sum OptimizationDounia Lakhmiri, Dominique Orban, Andrea Lodi
We consider the problem of training a deep neural network with nonsmooth regularization to retrieve a sparse and efficient sub-structure. Our regularizer is only assumed to be lower semi-continuous and prox-bounded. We combine an adaptive quadratic regularization approach with proximal stochastic gradient principles to derive a new solver, called SR2, whose convergence and worst-case complexity are established without knowledge or approximation of the gradient's Lipschitz constant. We formulate a stopping criteria that ensures an appropriate first-order stationarity measure converges to zero under certain conditions. We establish a worst-case iteration complexity of $\mathcal{O}(ε^{-2})$ that matches those of related methods like ProxGEN, where the learning rate is assumed to be related to the Lipschitz constant. Our experiments on network instances trained on CIFAR-10 and CIFAR-100 with $\ell_1$ and $\ell_0$ regularizations show that SR2 consistently achieves higher sparsity and accuracy than related methods such as ProxGEN and ProxSGD.
83.3NAMar 30
A Spectral Preconditioner for the Conjugate Gradient Method with Iteration BudgetYoussef Diouane, Selime Gürol, Oussama Mouhtal et al.
We study the solution of large symmetric positive-definite linear systems in a matrix-free setting with a limited iteration budget. We focus on the preconditioned conjugate gradient (PCG) method with spectral preconditioning. Spectral preconditioners map a subset of eigenvalues to a positive cluster via a scaling parameter, and leave the remainder of the spectrum unchanged, in hopes to reduce the number of iterations to convergence. We formulate the design of the spectral preconditioners as a constrained optimization problem. The optimal cluster placement is defined to minimize the error in energy norm at a fixed iteration. This optimality criterion provides new insight into the design of efficient spectral preconditioners when PCG is stopped short of convergence. We propose practical strategies for selecting the scaling parameter, hence the cluster position, that incur negligible computational cost. Numerical experiments highlight the importance of cluster placement and demonstrate significant improvements in terms of error in energy norm, particularly during the initial iterations.
OCSep 28, 2024
A Proximal Modified Quasi-Newton Method for Nonsmooth Regularized OptimizationYoussef Diouane, Mohamed Laghdaf Habiboullah, Dominique Orban
We develop R2N, a modified quasi-Newton method for minimizing the sum of a $\mathcal{C}^1$ function $f$ and a lower semi-continuous prox-bounded $h$. Both $f$ and $h$ may be nonconvex. At each iteration, our method computes a step by minimizing the sum of a quadratic model of $f$, a model of $h$, and an adaptive quadratic regularization term. A step may be computed by a variant of the proximal-gradient method. An advantage of R2N over trust-region (TR) methods is that proximal operators do not involve an extra TR indicator. We also develop the variant R2DH, in which the model Hessian is diagonal, which allows us to compute a step without relying on a subproblem solver when $h$ is separable. R2DH can be used as standalone solver, but also as subproblem solver inside R2N. We describe non-monotone variants of both R2N and R2DH. Global convergence of a first-order stationarity measure to zero holds without relying on local Lipschitz continuity of $\nabla f$, while allowing model Hessians to grow unbounded, an assumption particularly relevant to quasi-Newton models. Under Lipschitz-continuity of $\nabla f$, we establish a tight worst-case complexity bound of $O(1 / ε^{2/(1 - p)})$ to bring said measure below $ε> 0$, where $0 \leq p < 1$ controls the growth of model Hessians. The latter must not diverge faster than $|\mathcal{S}_k|^p$, where $\mathcal{S}_k$ is the set of successful iterations up to iteration $k$. When $p = 1$, we establish the tight exponential complexity bound $O(\exp(c ε^{-2}))$ where $c > 0$ is a constant. We describe our Julia implementation and report numerical experience on a classic basis-pursuit problem, an image denoising problem, a minimum-rank matrix completion problem, a nonlinear support vector machine and an inverse nonlinear problem.
LGNov 29, 2021
Adaptive First- and Second-Order Algorithms for Large-Scale Machine LearningSanae Lotfi, Tiphaine Bonniot de Ruisselet, Dominique Orban et al.
In this paper, we consider both first- and second-order techniques to address continuous optimization problems arising in machine learning. In the first-order case, we propose a framework of transition from deterministic or semi-deterministic to stochastic quadratic regularization methods. We leverage the two-phase nature of stochastic optimization to propose a novel first-order algorithm with adaptive sampling and adaptive step size. In the second-order case, we propose a novel stochastic damped L-BFGS method that improves on previous algorithms in the highly nonconvex context of deep learning. Both algorithms are evaluated on well-known deep learning datasets and exhibit promising performance.
LGDec 10, 2020
Stochastic Damped L-BFGS with Controlled Norm of the Hessian ApproximationSanae Lotfi, Tiphaine Bonniot de Ruisselet, Dominique Orban et al.
We propose a new stochastic variance-reduced damped L-BFGS algorithm, where we leverage estimates of bounds on the largest and smallest eigenvalues of the Hessian approximation to balance its quality and conditioning. Our algorithm, VARCHEN, draws from previous work that proposed a novel stochastic damped L-BFGS algorithm called SdLBFGS. We establish almost sure convergence to a stationary point and a complexity bound. We empirically demonstrate that VARCHEN is more robust than SdLBFGS-VR and SVRG on a modified DavidNet problem -- a highly nonconvex and ill-conditioned problem that arises in the context of deep learning, and their performance is comparable on a logistic regression problem and a nonconvex support-vector machine problem.