LGAIFeb 2, 2023

Mnemosyne: Learning to Train Transformers with Transformers

arXiv:2302.01128v314 citationsh-index: 51
AI Analysis

This addresses the need for efficient and automated optimizer tuning in machine learning, particularly for large-scale models, though it is incremental as it builds on existing learnable optimizer concepts.

The paper tackles the problem of automating optimizer design for neural networks by introducing Mnemosyne, a learnable optimizer based on spatio-temporal low-rank implicit attention Transformers, which outperforms LSTM optimizers, matches SOTA hand-designed optimizers in accuracy, and scales efficiently to large models like ViT-Hs and T5XXL.

In this work, we propose a new class of learnable optimizers, called \textit{Mnemosyne}. It is based on the novel spatio-temporal low-rank implicit attention Transformers that can learn to train entire neural network architectures, including other Transformers, without any task-specific optimizer tuning. We show that Mnemosyne: (a) outperforms popular LSTM optimizers (also with new feature engineering to mitigate catastrophic forgetting of LSTMs), (b) can successfully train Transformers while using simple meta-training strategies that require minimal computational resources, (c) matches accuracy-wise SOTA hand-designed optimizers with carefully tuned hyper-parameters (often producing top performing models). Furthermore, Mnemosyne provides space complexity comparable to that of its hand-designed first-order counterparts, which allows it to scale to training larger sets of parameters. We conduct an extensive empirical evaluation of Mnemosyne on: (a) fine-tuning a wide range of Vision Transformers (ViTs) from medium-size architectures to massive ViT-Hs (36 layers, 16 heads), (b) pre-training BERT models and (c) soft prompt-tuning large 11B+ T5XXL models. We complement our results with a comprehensive theoretical analysis of the compact associative memory used by Mnemosyne which we believe was never done before.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes