Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win
This addresses the problem of inefficient training in sparse neural networks for researchers and practitioners in machine learning, offering incremental insights into gradient flow dynamics.
The paper investigates why training unstructured sparse neural networks from random initialization leads to poor generalization, and why Lottery Tickets and Dynamic Sparse Training are exceptions. It shows that sparse networks have poor gradient flow at initialization, highlights the importance of sparsity-aware initialization, and finds that Lottery Tickets succeed by re-learning pruning solutions rather than improving gradient flow.
Sparse Neural Networks (NNs) can match the generalization of dense NNs using a fraction of the compute/storage for inference, and also have the potential to enable efficient training. However, naively training unstructured sparse NNs from random initialization results in significantly worse generalization, with the notable exceptions of Lottery Tickets (LTs) and Dynamic Sparse Training (DST). Through our analysis of gradient flow during training we attempt to answer: (1) why training unstructured sparse networks from random initialization performs poorly and; (2) what makes LTs and DST the exceptions? We show that sparse NNs have poor gradient flow at initialization and demonstrate the importance of using sparsity-aware initialization. Furthermore, we find that DST methods significantly improve gradient flow during training over traditional sparse training methods. Finally, we show that LTs do not improve gradient flow, rather their success lies in re-learning the pruning solution they are derived from - however, this comes at the cost of learning novel solutions.