LG ST MLNov 23, 2024

Gradient dynamics for low-rank fine-tuning beyond kernels

arXiv:2411.15385v16.42 citationsh-index: 4

Originality Incremental advance

AI Analysis

This provides theoretical insights into low-rank fine-tuning for foundation models, addressing a gap in mathematical understanding, though it is incremental as it builds on existing methods like LoRA.

The paper tackles the problem of understanding the learning mechanisms behind low-rank fine-tuning, specifically LoRA, by analyzing gradient dynamics in a student-teacher setting with a rank-1 perturbation. It proves that online gradient descent converges to the teacher model in dk^{O(1)} iterations, independent of activation function properties, unlike in GLM regression.

LoRA has emerged as one of the de facto methods for fine-tuning foundation models with low computational cost and memory footprint. The idea is to only train a low-rank perturbation to the weights of a pre-trained model, given supervised data for a downstream task. Despite its empirical sucess, from a mathematical perspective it remains poorly understood what learning mechanisms ensure that gradient descent converges to useful low-rank perturbations. In this work we study low-rank fine-tuning in a student-teacher setting. We are given the weights of a two-layer base model $f$, as well as i.i.d. samples $(x,f^*(x))$ where $x$ is Gaussian and $f^*$ is the teacher model given by perturbing the weights of $f$ by a rank-1 matrix. This generalizes the setting of generalized linear model (GLM) regression where the weights of $f$ are zero. When the rank-1 perturbation is comparable in norm to the weight matrix of $f$, the training dynamics are nonlinear. Nevertheless, in this regime we prove under mild assumptions that a student model which is initialized at the base model and trained with online gradient descent will converge to the teacher in $dk^{O(1)}$ iterations, where $k$ is the number of neurons in $f$. Importantly, unlike in the GLM setting, the complexity does not depend on fine-grained properties of the activation's Hermite expansion. We also prove that in our setting, learning the teacher model "from scratch'' can require significantly more iterations.

View on arXiv PDF

Similar