LGJul 28, 2023
How regularization affects the geometry of loss functionsNathaniel Bottman, Y. Cooper, Antonio Lerario
What neural networks learn depends fundamentally on the geometry of the underlying loss function. We study how different regularizers affect the geometry of this function. One of the most basic geometric properties of a smooth function is whether it is Morse or not. For nonlinear deep neural networks, the unregularized loss function $L$ is typically not Morse. We consider several different regularizers, including weight decay, and study for which regularizers the regularized function $L_ε$ becomes Morse.
LGMay 8, 2020
The critical locus of overparameterized neural networksY. Cooper
Many aspects of the geometry of loss functions in deep learning remain mysterious. In this paper, we work toward a better understanding of the geometry of the loss function $L$ of overparameterized feedforward neural networks. In this setting, we identify several components of the critical locus of $L$ and study their geometric properties. For networks of depth $\ell \geq 4$, we identify a locus of critical points we call the star locus $S$. Within $S$ we identify a positive-dimensional sublocus $C$ with the property that for $p \in C$, $p$ is a degenerate critical point, and no existing theoretical result guarantees that gradient descent will not converge to $p$. For very wide networks, we build on earlier work and show that all critical points of $L$ are degenerate, and give lower bounds on the number of zero eigenvalues of the Hessian at each critical point. For networks that are both deep and very wide, we compare the growth rates of the zero eigenspaces of the Hessian at all the different families of critical points that we identify. The results in this paper provide a starting point to a more quantitative understanding of the properties of various components of the critical locus of $L$.
OCSep 14, 2018
Gradient descent in higher codimensionY. Cooper
We consider the behavior of gradient flow and of discrete and noisy gradient descent. It is commonly noted that the addition of noise to the process of discrete gradient descent can affect the trajectory of gradient descent. In previous work, we observed such effects. There, we considered the case where the minima had codimension 1. In this note, we do some computer experiments and observe the behavior of noisy gradient descent in the more complex setting of minima of higher codimension.
OCAug 14, 2018
Gradient descent in some simple settingsY. Cooper
In this note, we observe the behavior of gradient flow and discrete and noisy gradient descent in some simple settings. It is commonly noted that addition of noise to gradient descent can affect the trajectory of gradient descent. Here, we run some computer experiments for gradient descent on some simple functions, and observe this principle in some concrete examples.