Decoupled Weight Decay for Any $p$ Norm
This work addresses efficiency issues in deep learning for practitioners by providing an incremental improvement in sparsification methods.
The paper tackles the computational and storage bottlenecks in training large neural networks by introducing a novel weight decay scheme that generalizes standard L2 regularization to any p norm, leading to highly sparse networks while maintaining comparable generalization performance.
With the success of deep neural networks (NNs) in a variety of domains, the computational and storage requirements for training and deploying large NNs have become a bottleneck for further improvements. Sparsification has consequently emerged as a leading approach to tackle these issues. In this work, we consider a simple yet effective approach to sparsification, based on the Bridge, or $L_p$ regularization during training. We introduce a novel weight decay scheme, which generalizes the standard $L_2$ weight decay to any $p$ norm. We show that this scheme is compatible with adaptive optimizers, and avoids the gradient divergence associated with $0<p<1$ norms. We empirically demonstrate that it leads to highly sparse networks, while maintaining generalization performance comparable to standard $L_2$ regularization.