LGJun 20, 2023

InRank: Incremental Low-Rank Learning

Jiawei Zhao, Yifei Zhang, Beidi Chen, Florian Schäfer, Anima Anandkumar

arXiv:2306.11250v216.012 citationsh-index: 78Has Code

Originality Highly original

AI Analysis

This work addresses computational inefficiency in training large neural networks like transformers by leveraging low-rank properties, offering practical speed and size improvements.

The paper tackles the gap between greedy low-rank learning theory and practical training by proving that cumulative weight updates follow an incremental low-rank trajectory for arbitrary orthogonal initialization, and introduces InRank, a training algorithm that explicitly uses low-rank matrices to achieve comparable performance to full-rank training with up to 33% fewer ranks and reductions of 37% in training time and 36% in model size for GPT-medium.

The theory of greedy low-rank learning (GLRL) aims to explain the impressive generalization capabilities of deep learning. It proves that stochastic gradient-based training implicitly regularizes neural networks towards low-rank solutions through a gradual increase of the rank during training. However, there is a gap between theory and practice since GLRL requires an infinitesimal initialization of the weights, which is not practical due to the fact that it is a saddle point. In this work, we remove the assumption of infinitesimal initialization by focusing on cumulative weight updates. We prove the cumulative weight updates follow an incremental low-rank trajectory for arbitrary orthogonal initialization of weights in a three-layer linear network. Empirically, we demonstrate that our theory holds on a broad range of neural networks (e.g., transformers) and standard training algorithms (e.g., SGD, Adam). However, existing training algorithms do not exploit the low-rank property to improve computational efficiency as the networks are not parameterized in low-rank. To remedy this, we design a new training algorithm Incremental Low-Rank Learning (InRank), which explicitly expresses cumulative weight updates as low-rank matrices while incrementally augmenting their ranks during training. We evaluate InRank on GPT-2, and our results indicate that InRank achieves comparable prediction performance as the full-rank counterpart while requiring at most 33% of the total ranks throughout training. We also propose an efficient version of InRank that achieves a reduction of 37% in total training time and 36% in model size when training GPT-medium on WikiText-103 from scratch.

View on arXiv PDF Code

Similar