Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq Models
This work addresses efficiency in pre-training for multilingual NLP models, offering a practical recipe for researchers and practitioners, though it is incremental as it builds on existing pre-training methods.
The paper tackles the problem of computationally expensive pre-training for both encoder-only and seq2seq models by proposing a two-stage approach that initializes one model from the other, achieving a 27% reduction in total compute cost while matching the performance of training each model from scratch.
Pre-trained encoder-only and sequence-to-sequence (seq2seq) models each have advantages, however training both model types from scratch is computationally expensive. We explore recipes to improve pre-training efficiency by initializing one model from the other. (1) Extracting the encoder from a seq2seq model, we show it under-performs a Masked Language Modeling (MLM) encoder, particularly on sequence labeling tasks. Variations of masking during seq2seq training, reducing the decoder size, and continuing with a small amount of MLM training do not close the gap. (2) Conversely, using an encoder to warm-start seq2seq training, we show that by unfreezing the encoder partway through training, we can match task performance of a from-scratch seq2seq model. Overall, this two-stage approach is an efficient recipe to obtain both a multilingual encoder and a seq2seq model, matching the performance of training each model from scratch while reducing the total compute cost by 27%.