CLNov 21, 2023

IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages

Muhammad Farid Adilazuarda, Samuel Cahyawijaya, Genta Indra Winata, Pascale Fung, Ayu Purwarianti

arXiv:2311.12405v129.4304 citationsh-index: 44

Originality Synthesis-oriented

AI Analysis

This work addresses robustness in NLP for Indonesian code-mixed languages, which is an incremental domain-specific problem.

The paper tackles the problem of limited exploration of code-mixing in Indonesian NLP, focusing on robustness against diverse local languages like English, Sundanese, Javanese, and Malay, and introduces IndoRobusta as a framework for evaluation and improvement, with analysis showing that pre-training corpus bias affects models' handling of Indonesian-English code-mixing compared to other languages despite higher diversity.

Significant progress has been made on Indonesian NLP. Nevertheless, exploration of the code-mixing phenomenon in Indonesian is limited, despite many languages being frequently mixed with Indonesian in daily conversation. In this work, we explore code-mixing in Indonesian with four embedded languages, i.e., English, Sundanese, Javanese, and Malay; and introduce IndoRobusta, a framework to evaluate and improve the code-mixing robustness. Our analysis shows that the pre-training corpus bias affects the model's ability to better handle Indonesian-English code-mixing when compared to other local languages, despite having higher language diversity.

View on arXiv PDF

Similar