OCJul 7, 2020
Asymptotic behaviour of learning rates in Armijo's conditionTuyen Trung Truong, Tuan Hang Nguyen
Fix a constant $0<α<1$. For a $C^1$ function $f:\mathbb{R}^k\rightarrow \mathbb{R}$, a point $x$ and a positive number $δ>0$, we say that Armijo's condition is satisfied if $f(x-δ\nabla f(x))-f(x)\leq -αδ||\nabla f(x)||^2$. It is a basis for the well known Backtracking Gradient Descent (Backtracking GD) algorithm. Consider a sequence $\{x_n\}$ defined by $x_{n+1}=x_n-δ_n\nabla f(x_n)$, for positive numbers $δ_n$ for which Armijo's condition is satisfied. We show that if $\{x_n\}$ converges to a non-degenerate critical point, then $\{δ_n\}$ must be bounded. Moreover this boundedness can be quantified in terms of the norms of the Hessian $\nabla ^2f$ and its inverse at the limit point. This complements the first author's results on Unbounded Backtracking GD, and shows that in case of convergence to a non-degenerate critical point the behaviour of Unbounded Backtracking GD is not too different from that of usual Backtracking GD. On the other hand, in case of convergence to a degenerate critical point the behaviours can be very much different. We run some experiments to illustrate that both scenrios can really happen. In another part of the paper, we argue that Backtracking GD has the correct unit (according to a definition by Zeiler in his Adadelta's paper). The main point is that since learning rate in Backtracking GD is bound by Armijo's condition, it is not unitless.
OCJun 2, 2020
A fast and simple modification of Newton's method helping to avoid saddle pointsTuyen Trung Truong, Tat Dat To, Tuan Hang Nguyen et al.
We propose in this paper New Q-Newton's method. The update rule is very simple conceptually, for example $x_{n+1}=x_n-w_n$ where $w_n=pr_{A_n,+}(v_n)-pr_{A_n,-}(v_n)$, with $A_n=\nabla ^2f(x_n)+δ_n||\nabla f(x_n)||^2.Id$ and $v_n=A_n^{-1}.\nabla f(x_n)$. Here $δ_n$ is an appropriate real number so that $A_n$ is invertible, and $pr_{A_n,\pm}$ are projections to the vector subspaces generated by eigenvectors of positive (correspondingly negative) eigenvalues of $A_n$. The main result of this paper roughly says that if $f$ is $C^3$ (can be unbounded from below) and a sequence $\{x_n\}$, constructed by the New Q-Newton's method from a random initial point $x_0$, {\bf converges}, then the limit point is a critical point and is not a saddle point, and the convergence rate is the same as that of Newton's method. The first author has recently been successful incorporating Backtracking line search to New Q-Newton's method, thus resolving the convergence guarantee issue observed for some (non-smooth) cost functions. An application to quickly finding zeros of a univariate meromorphic function will be discussed. Various experiments are performed, against well known algorithms such as BFGS and Adaptive Cubic Regularization are presented.
OCAug 15, 2018
Backtracking gradient descent method for general $C^1$ functions, with applications to Deep LearningTuyen Trung Truong, Tuan Hang Nguyen
While Standard gradient descent is one very popular optimisation method, its convergence cannot be proven beyond the class of functions whose gradient is globally Lipschitz continuous. As such, it is not actually applicable to realistic applications such as Deep Neural Networks. In this paper, we prove that its backtracking variant behaves very nicely, in particular convergence can be shown for all Morse functions. The main theoretical result of this paper is as follows. Theorem. Let $f:\mathbb{R}^k\rightarrow \mathbb{R}$ be a $C^1$ function, and $\{z_n\}$ a sequence constructed from the Backtracking gradient descent algorithm. (1) Either $\lim _{n\rightarrow\infty}||z_n||=\infty$ or $\lim _{n\rightarrow\infty}||z_{n+1}-z_n||=0$. (2) Assume that $f$ has at most countably many critical points. Then either $\lim _{n\rightarrow\infty}||z_n||=\infty$ or $\{z_n\}$ converges to a critical point of $f$. (3) More generally, assume that all connected components of the set of critical points of $f$ are compact. Then either $\lim _{n\rightarrow\infty}||z_n||=\infty$ or $\{z_n\}$ is bounded. Moreover, in the latter case the set of cluster points of $\{z_n\}$ is connected. Some generalised versions of this result, including an inexact version, are included. Another result in this paper concerns the problem of saddle points. We then present a heuristic argument to explain why Standard gradient descent method works so well, and modifications of the backtracking versions of GD, MMT and NAG. Experiments with datasets CIFAR10 and CIFAR100 on various popular architectures verify the heuristic argument also for the mini-batch practice and show that our new algorithms, while automatically fine tuning learning rates, perform better than current state-of-the-art methods such as MMT, NAG, Adagrad, Adadelta, RMSProp, Adam and Adamax.