CLLGMay 20, 2023

Lifelong Language Pretraining with Distribution-Specialized Experts

arXiv:2305.12281v189 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of lifelong learning for language models, enabling adaptation to new tasks without forgetting, which is incremental as it builds on prior MoE and regularization techniques.

The paper tackles the problem of adapting large language models to new data distributions without catastrophic forgetting by proposing Lifelong-MoE, an extensible Mixture-of-Experts architecture that dynamically adds experts with regularized pretraining. The result shows that this approach achieves better few-shot performance on 19 downstream NLP tasks compared to existing lifelong learning methods while keeping computation cost constant.

Pretraining on a large-scale corpus has become a standard method to build general language models (LMs). Adapting a model to new data distributions targeting different downstream tasks poses significant challenges. Naive fine-tuning may incur catastrophic forgetting when the over-parameterized LMs overfit the new data but fail to preserve the pretrained features. Lifelong learning (LLL) aims to enable information systems to learn from a continuous data stream across time. However, most prior work modifies the training recipe assuming a static fixed network architecture. We find that additional model capacity and proper regularization are key elements to achieving strong LLL performance. Thus, we propose Lifelong-MoE, an extensible MoE (Mixture-of-Experts) architecture that dynamically adds model capacity via adding experts with regularized pretraining. Our results show that by only introducing a limited number of extra experts while keeping the computation cost constant, our model can steadily adapt to data distribution shifts while preserving the previous knowledge. Compared to existing lifelong learning approaches, Lifelong-MoE achieves better few-shot performance on 19 downstream NLP tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes