CL LGOct 8, 2020

On the importance of pre-training data volume for compact language models

Vincent Micheli, Martin d'Hoffschmidt, François Fleuret

arXiv:2010.03813v231.21000 citations

Originality Synthesis-oriented

AI Analysis

This work addresses resource efficiency for sustainable AI by optimizing data usage in language modeling, though it is incremental as it builds on existing BERT methods.

The study investigated how pre-training data volume affects compact language models, finding that BERT-based models can perform well on French question answering with only 100 MB of text and that intermediate pre-training offers limited benefits beyond critical data thresholds.

Recent advances in language modeling have led to computationally intensive and resource-demanding state-of-the-art models. In an effort towards sustainable practices, we study the impact of pre-training data volume on compact language models. Multiple BERT-based models are trained on gradually increasing amounts of French text. Through fine-tuning on the French Question Answering Dataset (FQuAD), we observe that well-performing models are obtained with as little as 100 MB of text. In addition, we show that past critically low amounts of pre-training data, an intermediate pre-training step on the task-specific corpus does not yield substantial improvements.

View on arXiv PDF

Similar