spred: Solving $L_1$ Penalty with SGD
This work addresses the challenge of applying L1 penalties in deep learning for sparsity, bridging a gap with statistical learning, but it is incremental as it generalizes prior ideas.
The paper tackles the problem of minimizing differentiable objectives with L1 constraints by proposing spred, a method using reparametrization and stochastic gradient descent, and demonstrates its effectiveness in training sparse neural networks for gene selection and neural network compression, achieving successful sparsity where previous methods failed.
We propose to minimize a generic differentiable objective with $L_1$ constraint using a simple reparametrization and straightforward stochastic gradient descent. Our proposal is the direct generalization of previous ideas that the $L_1$ penalty may be equivalent to a differentiable reparametrization with weight decay. We prove that the proposed method, \textit{spred}, is an exact differentiable solver of $L_1$ and that the reparametrization trick is completely ``benign" for a generic nonconvex function. Practically, we demonstrate the usefulness of the method in (1) training sparse neural networks to perform gene selection tasks, which involves finding relevant features in a very high dimensional space, and (2) neural network compression task, to which previous attempts at applying the $L_1$-penalty have been unsuccessful. Conceptually, our result bridges the gap between the sparsity in deep learning and conventional statistical learning.