A multilevel approach to accelerate the training of Transformers
This work addresses training speed issues for researchers and practitioners using Transformers, but appears incremental as it builds on existing ODE interpretations.
The paper tackled the problem of slow training for Transformers by proposing a multilevel approach based on an ODE interpretation to vary discretization, and experimentally validated it against standard training.
In this article, we investigate the potential of multilevel approaches to accelerate the training of transformer architectures. Using an ordinary differential equation (ODE) interpretation of these architectures, we propose an appropriate way of varying the discretization of these ODE Transformers in order to accelerate the training. We validate our approach experimentally by a comparison with the standard training procedure.