Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting
This addresses the problem of performance degradation in transfer learning for NLP practitioners, though it is incremental as it builds on existing fine-tuning paradigms.
The paper tackles catastrophic forgetting in fine-tuning pretrained language models by introducing a recall and learn mechanism that jointly learns pretraining and downstream tasks, achieving state-of-the-art performance on the GLUE benchmark and enabling BERT-base to outperform BERT-large.
Deep pretrained language models have achieved great success in the way of pretraining first and then fine-tuning. But such a sequential transfer learning paradigm often confronts the catastrophic forgetting problem and leads to sub-optimal performance. To fine-tune with less forgetting, we propose a recall and learn mechanism, which adopts the idea of multi-task learning and jointly learns pretraining tasks and downstream tasks. Specifically, we propose a Pretraining Simulation mechanism to recall the knowledge from pretraining tasks without data, and an Objective Shifting mechanism to focus the learning on downstream tasks gradually. Experiments show that our method achieves state-of-the-art performance on the GLUE benchmark. Our method also enables BERT-base to achieve better performance than directly fine-tuning of BERT-large. Further, we provide the open-source RecAdam optimizer, which integrates the proposed mechanisms into Adam optimizer, to facility the NLP community.