It's not a Lottery, it's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task
This work addresses a foundational problem in understanding neural network training dynamics for researchers in machine learning theory, though it is incremental as it builds on existing concepts like the lottery ticket hypothesis.
The paper investigates how gradient descent reduces the theoretical capacity of neural networks to an effective capacity that fits the task, by analyzing learning dynamics in single hidden layer ReLU networks and identifying three dynamical principles—mutual alignment, unlocking, and racing—that explain phenomena like neuron merging and pruning, and specifically elucidates the mechanism behind the lottery ticket conjecture.
Our theoretical understanding of neural networks is lagging behind their empirical success. One of the important unexplained phenomena is why and how, during the process of training with gradient descent, the theoretical capacity of neural networks is reduced to an effective capacity that fits the task. We here investigate the mechanism by which gradient descent achieves this through analyzing the learning dynamics at the level of individual neurons in single hidden layer ReLU networks. We identify three dynamical principles -- mutual alignment, unlocking and racing -- that together explain why we can often successfully reduce capacity after training through the merging of equivalent neurons or the pruning of low norm weights. We specifically explain the mechanism behind the lottery ticket conjecture, or why the specific, beneficial initial conditions of some neurons lead them to obtain higher weight norms.