CLDec 15, 2022

The Effects of In-domain Corpus Size on pre-training BERT

arXiv:2212.07914v110 citationsh-index: 3
Originality Synthesis-oriented
AI Analysis

This addresses the challenge of limited in-domain data collection for researchers in biomedical NLP, showing that even small datasets can be effective, but it is incremental as it builds on existing pre-training methods.

The study investigated how the size of in-domain biomedical corpora affects BERT pre-training, finding that using a relatively small amount (4GB) of in-domain data with limited training steps improves performance on downstream domain-specific NLP tasks compared to fine-tuning models pre-trained on general corpora.

Many prior language modeling efforts have shown that pre-training on an in-domain corpus can significantly improve performance on downstream domain-specific NLP tasks. However, the difficulties associated with collecting enough in-domain data might discourage researchers from approaching this pre-training task. In this paper, we conducted a series of experiments by pre-training Bidirectional Encoder Representations from Transformers (BERT) with different sizes of biomedical corpora. The results demonstrate that pre-training on a relatively small amount of in-domain data (4GB) with limited training steps, can lead to better performance on downstream domain-specific NLP tasks compared with fine-tuning models pre-trained on general corpora.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes