MLMay 3, 2022
Convergence of Stochastic Approximation via Martingale and Converse Lyapunov MethodsM. Vidyasagar
In this paper, we study the almost sure boundedness and the convergence of the stochastic approximation (SA) algorithm. At present, most available convergence proofs are based on the ODE method, and the almost sure boundedness of the iterations is an assumption and not a conclusion. In Borkar-Meyn (2000), it is shown that if the ODE has only one globally attractive equilibrium, then under additional assumptions, the iterations are bounded almost surely, and the SA algorithm converges to the desired solution. Our objective in the present paper is to provide an alternate proof of the above, based on martingale methods, which are simpler and less technical than those based on the ODE method. As a prelude, we prove a new sufficient condition for the global asymptotic stability of an ODE. Next we prove a "converse" Lyapunov theorem on the existence of a suitable Lyapunov function with a globally bounded Hessian, for a globally exponentially stable system. Both theorems are of independent interest to researchers in stability theory. Then, using these results, we provide sufficient conditions for the almost sure boundedness and the convergence of the SA algorithm. We show through examples that our theory covers some situations that are not covered by currently known results, specifically Borkar-Meyn (2000).
MLDec 5, 2023
Convergence Rates for Stochastic Approximation: Biased Noise with Unbounded Variance, and ApplicationsRajeeva L. Karandikar, M. Vidyasagar
In this paper, we study the convergence properties of the Stochastic Gradient Descent (SGD) method for finding a stationary point of a given objective function $J(\cdot)$. The objective function is not required to be convex. Rather, our results apply to a class of ``invex'' functions, which have the property that every stationary point is also a global minimizer. First, it is assumed that $J(\cdot)$ satisfies a property that is slightly weaker than the Kurdyka-Lojasiewicz (KL) condition, denoted here as (KL'). It is shown that the iterations $J(\boldsymbolθ_t)$ converge almost surely to the global minimum of $J(\cdot)$. Next, the hypothesis on $J(\cdot)$ is strengthened from (KL') to the Polyak-Lojasiewicz (PL) condition. With this stronger hypothesis, we derive estimates on the rate of convergence of $J(\boldsymbolθ_t)$ to its limit. Using these results, we show that for functions satisfying the PL property, the convergence rate of both the objective function and the norm of the gradient with SGD is the same as the best-possible rate for convex functions. While some results along these lines have been published in the past, our contributions contain two distinct improvements. First, the assumptions on the stochastic gradient are more general than elsewhere, and second, our convergence is almost sure, and not in expectation. We also study SGD when only function evaluations are permitted. In this setting, we determine the ``optimal'' increments or the size of the perturbations. Using the same set of ideas, we establish the global convergence of the Stochastic Approximation (SA) algorithm under more general assumptions on the measurement error, compared to the existing literature. We also derive bounds on the rate of convergence of the SA algorithm under appropriate assumptions.
MLSep 8, 2021
Convergence of Batch Asynchronous Stochastic Approximation With Applications to Reinforcement LearningRajeeva L. Karandikar, M. Vidyasagar
We begin by briefly surveying some results on the convergence of the Stochastic Gradient Descent (SGD) Method, proved in a companion paper by the present authors. These results are based on viewing SGD as a version of Stochastic Approximation (SA). Ever since its introduction in the classic paper of Robbins and Monro in 1951, SA has become a standard tool for finding a solution of an equation of the form $f(θ) = 0$, when only noisy measurements of $f(\cdot)$ are available. In most situations, \textit{every component} of the putative solution $θ_t$ is updated at each step $t$. In some applications in Reinforcement Learning (RL), \textit{only one component} of $θ_t$ is updated at each $t$. This is known as \textbf{asynchronous} SA. In this paper, we study \textbf{Block Asynchronous SA (BASA)}, in which, at each step $t$, \textit{some but not necessarily all} components of $θ_t$ are updated. The theory presented here embraces both conventional (synchronous) SA as well as asynchronous SA, and all in-between possibilities. We provide sufficient conditions for the convergence of BASA, and also prove bounds on the \textit{rate} of convergence of $θ_t$ to the solution. For the case of conventional SGD, these results reduce to those proved in our companion paper. Then we apply these results to the problem of finding a fixed point of a map with only noisy measurements. This problem arises frequently in RL. We prove sufficient conditions for convergence as well as estimates for the rate of convergence.
MLDec 29, 2015
Error Bounds for Compressed Sensing Algorithms With Group Sparsity: A Unified ApproachM. Eren Ahsen, M. Vidyasagar
In compressed sensing, in order to recover a sparse or nearly sparse vector from possibly noisy measurements, the most popular approach is $\ell_1$-norm minimization. Upper bounds for the $\ell_2$- norm of the error between the true and estimated vectors are given in [1] and reviewed in [2], while bounds for the $\ell_1$-norm are given in [3]. When the unknown vector is not conventionally sparse but is "group sparse" instead, a variety of alternatives to the $\ell_1$-norm have been proposed in the literature, including the group LASSO, sparse group LASSO, and group LASSO with tree structured overlapping groups. However, no error bounds are available for any of these modified objective functions. In the present paper, a unified approach is presented for deriving upper bounds on the error between the true vector and its approximation, based on the notion of decomposable and $γ$-decomposable norms. The bounds presented cover all of the norms mentioned above, and also provide a guideline for choosing norms in future to accommodate alternate forms of sparsity.