Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data
This work addresses the problem of understanding optimization biases in deep learning for researchers, providing theoretical insights into low-rank solutions in high-dimensional settings, but it is incremental as it builds on prior work on homogeneous networks.
The paper investigates the implicit bias of gradient flow and gradient descent in two-layer leaky ReLU networks on high-dimensional data, showing that gradient flow asymptotically produces a rank-at-most-two network with an approximate-max-margin linear predictor, and gradient descent with small initialization reduces network rank drastically.
The implicit biases of gradient-based optimization algorithms are conjectured to be a major factor in the success of modern deep learning. In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that asymptotically, gradient flow produces a neural network with rank at most two. Moreover, this network is an $\ell_2$-max-margin solution (in parameter space), and has a linear decision boundary that corresponds to an approximate-max-margin linear predictor. For gradient descent, provided the random initialization variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training. We provide experiments which suggest that a small initialization scale is important for finding low-rank neural networks with gradient descent.