How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP
This addresses data quality issues for researchers and practitioners using Wikipedia in low-resource and multilingual NLP, but it is incremental as it builds on existing quality filtering methods.
The paper tackled the problem of data quality in Wikipedia for low-resource and multilingual NLP by auditing it with quality filtering techniques, revealing issues like one-line and duplicate articles, and found that pruning improves resource efficiency without hurting performance.
Wikipedia's perceived high quality and broad language coverage have established it as a fundamental resource in multilingual NLP. In the context of low-resource languages, however, these quality assumptions are increasingly being scrutinised. This paper critically examines the data quality of Wikipedia in a non-English setting by subjecting it to various quality filtering techniques, revealing widespread issues such as a high percentage of one-line articles and duplicate articles. We evaluate the downstream impact of quality filtering on Wikipedia and find that data quality pruning is an effective means for resource-efficient training without hurting performance, especially for low-resource languages. Moreover, we advocate for a shift in perspective from seeking a general definition of data quality towards a more language- and task-specific one. Ultimately, we aim for this study to serve as a guide to using Wikipedia for pretraining in a multilingual setting.