CLSep 19, 2020

Towards Computational Linguistics in Minangkabau Language: Studies on Sentiment Analysis and Machine Translation

arXiv:2009.09309v1737 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of resource scarcity for low-resource languages like Minangkabau, enabling computational linguistics research, though it is incremental as it applies existing methods to new data.

The authors tackled the lack of computational resources for the Minangkabau language by releasing sentiment analysis and machine translation corpora from Twitter and Wikipedia, finding that models trained on Indonesian performed poorly for classification and a simple word-to-word translation method outperformed LSTM and Transformer models in BLEU scores.

Although some linguists (Rusmali et al., 1985; Crouch, 2009) have fairly attempted to define the morphology and syntax of Minangkabau, information processing in this language is still absent due to the scarcity of the annotated resource. In this work, we release two Minangkabau corpora: sentiment analysis and machine translation that are harvested and constructed from Twitter and Wikipedia. We conduct the first computational linguistics in Minangkabau language employing classic machine learning and sequence-to-sequence models such as LSTM and Transformer. Our first experiments show that the classification performance over Minangkabau text significantly drops when tested with the model trained in Indonesian. Whereas, in the machine translation experiment, a simple word-to-word translation using a bilingual dictionary outperforms LSTM and Transformer model in terms of BLEU score.

View on arXiv PDF

Similar