Riemannian Gradient Descent for Low-Rank Architectures
For practitioners of low-rank deep learning architectures, this work provides a systematic comparison of Riemannian geometries but shows no clear advantage over standard optimization.
The paper explores Riemannian optimization for low-rank matrix parameters in deep learning, but after tuning, the methods do not conclusively outperform the AdamW baseline.
We explore Riemannian optimization techniques for rank-factored matrix parameters, targeting contemporary deep learning applications. We examine ten points in the algorithm design space: two geometries for rank-$r$ matrices, three geometries for rank-$r$ partial isometries, and block-matrix variants of these five, where factors are shared across block-rows and block-columns. We apply our methods to the multihead attention parameters in small language models. After tuning learning rates, our methods do not conclusively outperform an AdamW baseline. Our implementations are available online.