Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model
This addresses the problem of inefficient pretraining for NLP practitioners by offering a more resource-efficient method, though it is incremental as it builds on existing pretraining paradigms.
The paper tackles the high computational and energy costs of large-scale pretraining by proposing Influence Subset Selection (ISS), which selects a tiny subset of the pretraining corpus using end-task knowledge, resulting in outperforming pretrained models like RoBERTa on eight datasets with only 0.45% of the data and a three-orders-of-magnitude lower computational cost.
Pretrained language models have achieved remarkable success in various natural language processing tasks. However, pretraining has recently shifted toward larger models and larger data, and this has resulted in significant computational and energy costs. In this paper, we propose Influence Subset Selection (ISS) for language model, which explicitly utilizes end-task knowledge to select a tiny subset of the pretraining corpus. Specifically, the ISS selects the samples that will provide the most positive influence on the performance of the end-task. Furthermore, we design a gradient matching based influence estimation method, which can drastically reduce the computation time of influence. With only 0.45% of the data and a three-orders-of-magnitude lower computational cost, ISS outperformed pretrained models (e.g., RoBERTa) on eight datasets covering four domains.