The Zeno's Paradox of `Low-Resource' Languages
This work addresses the issue of terminology clarity for researchers in NLP, but it is incremental as it primarily analyzes existing literature without proposing new methods.
The paper tackled the problem of inconsistent definitions of 'low-resource languages' in NLP by qualitatively analyzing 150 papers, showing that multiple interacting axes contribute to low-resourcedness and make progress tracking difficult.
The disparity in the languages commonly studied in Natural Language Processing (NLP) is typically reflected by referring to languages as low vs high-resourced. However, there is limited consensus on what exactly qualifies as a `low-resource language.' To understand how NLP papers define and study `low resource' languages, we qualitatively analyzed 150 papers from the ACL Anthology and popular speech-processing conferences that mention the keyword `low-resource.' Based on our analysis, we show how several interacting axes contribute to `low-resourcedness' of a language and why that makes it difficult to track progress for each individual language. We hope our work (1) elicits explicit definitions of the terminology when it is used in papers and (2) provides grounding for the different axes to consider when connoting a language as low-resource.