CLAINov 16, 2023

Source Prompt: Coordinated Pre-training of Language Models on Diverse Corpora from Multiple Sources

arXiv:2311.09732v11 citationsh-index: 22
Originality Incremental advance
AI Analysis

This addresses a bottleneck in scaling pre-trained language models for NLP by mitigating issues from heterogeneous data sources, though it is incremental as it builds on existing pre-training paradigms.

The paper tackles the problem of pre-training language models on diverse corpora from multiple sources, which can have negative side-effects, and proposes source prompts to explicitly indicate data sources during pre-training and fine-tuning, resulting in significant improvements in downstream tasks.

Pre-trained language models (PLMs) have established the new paradigm in the field of NLP. For more powerful PLMs, one of the most popular and successful way is to continuously scale up sizes of the models and the pre-training corpora. These large corpora are generally obtained by converging smaller ones from multiple sources, they are thus growing increasingly diverse. However, the side-effects of these colossal converged corpora remain understudied. In this paper, we identify the disadvantage of heterogeneous corpora from multiple sources for pre-training PLMs. Towards coordinated pre-training on diverse corpora, we further propose source prompts (SP), which explicitly prompt the model of the data source at the pre-training and fine-tuning stages. Results of extensive experiments demonstrate that PLMs pre-trained with SP on diverse corpora gain significant improvement in various downstream tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes