CLMay 31, 2022

NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

arXiv:2205.15960v2304 citationsh-index: 69
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of data scarcity for underrepresented languages, specifically in Indonesia, which is linguistically diverse but has many endangered languages, though it is incremental as it focuses on resource creation rather than novel methods.

The authors tackled the lack of NLP resources for low-resource languages in Indonesia by creating NusaX, the first parallel dataset for 10 Indonesian local languages, including datasets, benchmarks, and lexicons, to enable research and development in this area.

Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing resources for languages in Indonesia. Despite being the second most linguistically diverse country, most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia. Our resource includes datasets, a multi-task benchmark, and lexicons, as well as a parallel Indonesian-English dataset. We provide extensive analyses and describe the challenges when creating such resources. We hope that our work can spark NLP research on Indonesian and other underrepresented languages.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes