Structured Sparsity Inducing Adaptive Optimizers for Deep Learning
This work provides an incremental improvement to structured sparsity methods for deep learning practitioners, enabling more efficient pruning of neural networks.
The paper tackles the problem of pruning unimportant groups of parameters in neural networks by introducing non-differentiable penalties to the objective function. It derives weighted proximal operators for two structured sparsity-inducing penalties and proves that existing convergence guarantees are preserved even with efficient numerical approximations of these operators. The proposed adaptive method successfully finds solutions with structured sparsity patterns in computer vision and natural language processing tasks.
The parameters of a neural network are naturally organized in groups, some of which might not contribute to its overall performance. To prune out unimportant groups of parameters, we can include some non-differentiable penalty to the objective function, and minimize it using proximal gradient methods. In this paper, we derive the weighted proximal operator, which is a necessary component of these proximal methods, of two structured sparsity inducing penalties. Moreover, they can be approximated efficiently with a numerical solver, and despite this approximation, we prove that existing convergence guarantees are preserved when these operators are integrated as part of a generic adaptive proximal method. Finally, we show that this adaptive method, together with the weighted proximal operators derived here, is indeed capable of finding solutions with structure in their sparsity patterns, on representative examples from computer vision and natural language processing.