LGCLFeb 19, 2025

On the Duality between Gradient Transformations and Adapters

arXiv:2502.13811v22 citationsh-index: 7ICML
Originality Synthesis-oriented
AI Analysis

This work provides a theoretical framework for memory-efficient training, which is incremental as it connects existing methods rather than introducing a new paradigm.

The paper tackles the problem of memory-efficient neural network optimization by showing that linear gradient transformations are equivalent to reparameterizing models with linear adapters, unifying approaches like GaLore and LoRA.

We study memory-efficient optimization of neural networks (in particular language models) with linear gradient transformations, where the gradients are linearly mapped to a lower dimensional space than the full parameter space, thus saving memory required for gradient accumulation and optimizer state persistence. The model parameters are updated by first performing an optimization step in the lower dimensional space and then going back into the original parameter space via the linear map's transpose. We show that optimizing the model in this transformed space is equivalent to reparameterizing the original model through a linear adapter that additively modifies the model parameters, and then only optimizing the adapter's parameters. When the transformation is Kronecker-factored, this establishes an equivalence between GaLore and one-sided LoRA. We show that this duality between gradient transformations and adapter-based reparameterizations unifies existing approaches to memory-efficient training and suggests new techniques for improving training efficiency and memory use.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes