Automatic Document Selection for Efficient Encoder Pretraining
This addresses the efficiency problem for researchers and practitioners in NLP by reducing computational costs, though it is incremental as it extends an existing method.
The paper tackles the problem of expensive and data-intensive pretraining of language models by proposing automatic document selection to identify smaller, domain-representative subsets, resulting in a method that outperforms random selection with 20x less data, 3x fewer training iterations, and 2x less compute cost.
Building pretrained language models is considered expensive and data-intensive, but must we increase dataset size to achieve better performance? We propose an alternative to larger training sets by automatically identifying smaller yet domain-representative subsets. We extend Cynical Data Selection, a statistical sentence scoring method that conditions on a representative target domain corpus. As an example, we treat the OntoNotes corpus as a target domain and pretrain a RoBERTa-like encoder from a cynically selected subset of the Pile. On both perplexity and across several downstream tasks in the target domain, it consistently outperforms random selection with 20x less data, 3x fewer training iterations, and 2x less estimated cloud compute cost, validating the recipe of automatic document selection for LM pretraining.