CLNov 14, 2022

High-Resource Methodological Bias in Low-Resource Investigations

Maartje ter Hoeve, David Grangier, Natalie Schluter

arXiv:2211.07534v10.63 citationsh-index: 39

Originality Incremental advance

AI Analysis

This work addresses a methodological issue for researchers and practitioners in low-resource NLP, highlighting that common evaluation practices may be misleading, though it is incremental in nature.

The paper tackled the problem of methodological bias in low-resource NLP by showing that down-sampling high-resource language data creates datasets with different properties than true low-resource ones, leading to biased performance evaluations in POS-tagging and machine translation tasks.

The central bottleneck for low-resource NLP is typically regarded to be the quantity of accessible data, overlooking the contribution of data quality. This is particularly seen in the development and evaluation of low-resource systems via down sampling of high-resource language data. In this work we investigate the validity of this approach, and we specifically focus on two well-known NLP tasks for our empirical investigations: POS-tagging and machine translation. We show that down sampling from a high-resource language results in datasets with different properties than the low-resource datasets, impacting the model performance for both POS-tagging and machine translation. Based on these results we conclude that naive down sampling of datasets results in a biased view of how well these systems work in a low-resource scenario.

View on arXiv PDF

Similar