CLJun 9, 2022

Language Identification for Austronesian Languages

arXiv:2206.04327v131.0585 citationsh-index: 15Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses language identification for under-resourced Austronesian languages, which is incremental as it applies existing methods to new data.

The paper tackled language identification for low-resource Austronesian languages by evaluating six methods, finding that a skip-gram embedding classifier achieved significantly higher performance, and showed that increasing the language inventory to 800 languages had minimal impact on accuracy, with high accuracy also achieved in code-switching detection.

This paper provides language identification models for low- and under-resourced languages in the Pacific region with a focus on previously unavailable Austronesian languages. Accurate language identification is an important part of developing language resources. The approach taken in this paper combines 29 Austronesian languages with 171 non-Austronesian languages to create an evaluation set drawn from eight data sources. After evaluating six approaches to language identification, we find that a classifier based on skip-gram embeddings reaches a significantly higher performance than alternate methods. We then systematically increase the number of non-Austronesian languages in the model up to a total of 800 languages to evaluate whether an increased language inventory leads to less precise predictions for the Austronesian languages of interest. This evaluation finds that there is only a minimal impact on accuracy caused by increasing the inventory of non-Austronesian languages. Further experiments adapt these language identification models for code-switching detection, achieving high accuracy across all 29 languages.

View on arXiv PDF Code

Similar