On the Principle of Least Symmetry Breaking in Shallow ReLU Models
This addresses the optimization challenges in neural network training for researchers, though it appears incremental as it builds on known symmetry-breaking concepts.
The paper investigates the structure of spurious local minima in training two-layer ReLU networks with squared loss, showing that stochastic gradient descent detects minima that represent the least loss of symmetry relative to target weights, and extends this principle to broader settings like non-isotropic distributions and smooth activations.
We consider the optimization problem associated with fitting two-layer ReLU networks with respect to the squared loss, where labels are assumed to be generated by a target network. Focusing first on standard Gaussian inputs, we show that the structure of spurious local minima detected by stochastic gradient descent (SGD) is, in a well-defined sense, the \emph{least loss of symmetry} with respect to the target weights. A closer look at the analysis indicates that this principle of least symmetry breaking may apply to a broader range of settings. Motivated by this, we conduct a series of experiments which corroborate this hypothesis for different classes of non-isotropic non-product distributions, smooth activation functions and networks with a few layers.