CLMar 13, 2024

Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages

arXiv:2403.08693v184 citationsh-index: 10LREC
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of data quality for researchers and practitioners training language models on web-crawled corpora, but it is incremental as it builds on existing evaluation methods.

The study evaluated the quality of four large web-crawled corpora across 11 lower-resourced European languages and found that while MaCoCu and OSCAR had the best intrinsic quality, CC100 performed best in downstream LM tasks, indicating that corpus quality may not significantly impact LM training in these cases.

Large, curated, web-crawled corpora play a vital role in training language models (LMs). They form the lion's share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. However, despite this importance, relatively little attention has been given to the quality of these corpora. In this paper, we compare four of the currently most relevant large, web-crawled corpora (CC100, MaCoCu, mC4 and OSCAR) across eleven lower-resourced European languages. Our approach is two-fold: first, we perform an intrinsic evaluation by performing a human evaluation of the quality of samples taken from different corpora; then, we assess the practical impact of the qualitative differences by training specific LMs on each of the corpora and evaluating their performance on downstream tasks. We find that there are clear differences in quality of the corpora, with MaCoCu and OSCAR obtaining the best results. However, during the extrinsic evaluation, we actually find that the CC100 corpus achieves the highest scores. We conclude that, in our experiments, the quality of the web-crawled corpora does not seem to play a significant role when training LMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes