LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics
This addresses the memory bottleneck for researchers and practitioners training large-scale AI models, though it is an incremental improvement on existing adaptive optimization methods.
The paper tackles the memory inefficiency of training large models by introducing LDAdam, an optimizer that performs adaptive optimization in low-dimensional subspaces, reducing memory footprint to a fraction of the model size while maintaining accuracy in fine-tuning and pre-training of language models.
We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy keeps the optimizer's memory footprint to a fraction of the model size. LDAdam relies on a new projection-aware update rule for the optimizer states that allows for transitioning between subspaces, i.e., estimation of the statistics of the projected gradients. To mitigate the errors due to low-rank projection, LDAdam integrates a new generalized error feedback mechanism, which explicitly accounts for both gradient and optimizer state compression. We prove the convergence of LDAdam under standard assumptions, and show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models. Code is available at https://github.com/IST-DASLab/LDAdam