LGJun 24, 2024

Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis

arXiv:2406.17167v111 citations
Originality Incremental advance
AI Analysis

This provides foundational insights for efficient training and inference in large foundation models, though it is incremental as it focuses on a simplified one-layer case.

The paper tackles the theoretical understanding of why low-rank adaptation and pruning work for Transformers by analyzing a one-layer model, showing that trained parameters are low-rank and depend on label-relevant patterns, and that proper pruning has minimal impact on generalization.

Efficient training and inference algorithms, such as low-rank adaption and model pruning, have shown impressive performance for learning Transformer-based large foundation models. However, due to the technical challenges of the non-convex optimization caused by the complicated architecture of Transformers, the theoretical study of why these methods can be applied to learn Transformers is mostly elusive. To the best of our knowledge, this paper shows the first theoretical analysis of the property of low-rank and sparsity of one-layer Transformers by characterizing the trained model after convergence using stochastic gradient descent. By focusing on a data model based on label-relevant and label-irrelevant patterns, we quantify that the gradient updates of trainable parameters are low-rank, which depends on the number of label-relevant patterns. We also analyze how model pruning affects the generalization while improving computation efficiency and conclude that proper magnitude-based pruning has a slight effect on the testing performance. We implement numerical experiments to support our findings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes