CLAISep 12, 2020

Improving Indonesian Text Classification Using Multilingual Language Model

arXiv:2009.05713v113 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of low-resource language processing for Indonesian text classification tasks, but it is incremental as it applies existing multilingual models to a specific domain.

The paper tackled the problem of limited labeled data for Indonesian text classification by investigating the effect of combining English and Indonesian data using multilingual language models, finding that adding English data improves performance, especially when Indonesian data is scarce.

Compared to English, the amount of labeled data for Indonesian text classification tasks is very small. Recently developed multilingual language models have shown its ability to create multilingual representations effectively. This paper investigates the effect of combining English and Indonesian data on building Indonesian text classification (e.g., sentiment analysis and hate speech) using multilingual language models. Using the feature-based approach, we observe its performance on various data sizes and total added English data. The experiment showed that the addition of English data, especially if the amount of Indonesian data is small, improves performance. Using the fine-tuning approach, we further showed its effectiveness in utilizing the English language to build Indonesian text classification models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes