One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia
This addresses the issue of linguistic diversity and NLP accessibility for underrepresented languages, particularly in Indonesia, but is incremental as it reviews existing challenges and offers recommendations without new empirical results.
The paper tackles the problem of limited NLP resources and awareness for underrepresented languages and dialects, focusing on Indonesia's 700+ languages, and provides an overview of current research, challenges, and general recommendations to improve NLP technology for these languages.
NLP research is impeded by a lack of resources and awareness of the challenges presented by underrepresented languages and dialects. Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia's 700+ languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems. Finally, we provide general recommendations to help develop NLP technology not only for languages of Indonesia but also other underrepresented languages.