LGSTMLNov 23, 2024

Gradient dynamics for low-rank fine-tuning beyond kernels

arXiv:2411.15385v12 citationsh-index: 4
Originality Incremental advance
AI Analysis

This provides theoretical insights into low-rank fine-tuning for foundation models, addressing a gap in mathematical understanding, though it is incremental as it builds on existing methods like LoRA.

The paper tackles the problem of understanding the learning mechanisms behind low-rank fine-tuning, specifically LoRA, by analyzing gradient dynamics in a student-teacher setting with a rank-1 perturbation. It proves that online gradient descent converges to the teacher model in dk^{O(1)} iterations, independent of activation function properties, unlike in GLM regression.

LoRA has emerged as one of the de facto methods for fine-tuning foundation models with low computational cost and memory footprint. The idea is to only train a low-rank perturbation to the weights of a pre-trained model, given supervised data for a downstream task. Despite its empirical sucess, from a mathematical perspective it remains poorly understood what learning mechanisms ensure that gradient descent converges to useful low-rank perturbations. In this work we study low-rank fine-tuning in a student-teacher setting. We are given the weights of a two-layer base model $f$, as well as i.i.d. samples $(x,f^*(x))$ where $x$ is Gaussian and $f^*$ is the teacher model given by perturbing the weights of $f$ by a rank-1 matrix. This generalizes the setting of generalized linear model (GLM) regression where the weights of $f$ are zero. When the rank-1 perturbation is comparable in norm to the weight matrix of $f$, the training dynamics are nonlinear. Nevertheless, in this regime we prove under mild assumptions that a student model which is initialized at the base model and trained with online gradient descent will converge to the teacher in $dk^{O(1)}$ iterations, where $k$ is the number of neurons in $f$. Importantly, unlike in the GLM setting, the complexity does not depend on fine-grained properties of the activation's Hermite expansion. We also prove that in our setting, learning the teacher model "from scratch'' can require significantly more iterations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes