CLJun 24, 2022

MVP: Multi-task Supervised Pre-training for Natural Language Generation

Tianyi Tang, Junyi Li, Wayne Xin Zhao, Ji-Rong Wen

arXiv:2206.12131v322.2237 citationsh-index: 70Has Code

Originality Incremental advance

AI Analysis

This addresses the need for more effective pre-trained models in natural language generation, offering a supervised approach that outperforms existing methods, though it is incremental as it builds on prior supervised pre-training and instruction tuning techniques.

The paper tackles the problem of improving natural language generation by proposing Multi-task superVised Pre-training (MVP), which uses a large-scale corpus from 77 datasets across 11 tasks to pre-train a model in a supervised manner, achieving state-of-the-art performance on 13 out of 17 datasets with gains of 9.3% over BART and 5.8% over Flan-T5.

Pre-trained language models (PLMs) have achieved remarkable success in natural language generation (NLG) tasks. Up to now, most NLG-oriented PLMs are pre-trained in an unsupervised manner using the large-scale general corpus. In the meanwhile, an increasing number of models pre-trained with labeled data (i.e. "supervised pre-training") showcase superior performance compared to unsupervised pre-trained models. Motivated by the success of supervised pre-training, we propose Multi-task superVised Pre-training (MVP) for natural language generation. We collect a large-scale natural language generation corpus, MVPCorpus, from $77$ datasets over $11$ diverse NLG tasks. Then we unify these examples into a general text-to-text format to pre-train the text generation model MVP in a supervised manner. For each task, we further pre-train specific soft prompts to stimulate the model's capacity to perform a specific task. Our MVP model can be seen as a practice that utilizes recent instruction tuning on relatively small PLMs. Extensive experiments have demonstrated the effectiveness and generality of our MVP model in a number of NLG tasks, which achieves state-of-the-art performance on $13$ out of $17$ datasets, outperforming BART by $9.3\%$ and Flan-T5 by $5.8\%$.

View on arXiv PDF Code

Similar