CLNov 2, 2023Code
COPAL-ID: Indonesian Language Reasoning with Local Culture and NuancesHaryo Akbarianto Wibowo, Erland Hilman Fuadi, Made Nindyatama Nityasya et al.
We present COPAL-ID, a novel, public Indonesian language common sense reasoning dataset. Unlike the previous Indonesian COPA dataset (XCOPA-ID), COPAL-ID incorporates Indonesian local and cultural nuances, and therefore, provides a more natural portrayal of day-to-day causal reasoning within the Indonesian cultural sphere. Professionally written by natives from scratch, COPAL-ID is more fluent and free from awkward phrases, unlike the translated XCOPA-ID. In addition, we present COPAL-ID in both standard Indonesian and in Jakartan Indonesian-a dialect commonly used in daily conversation. COPAL-ID poses a greater challenge for existing open-sourced and closed state-of-the-art multilingual language models, yet is trivially easy for humans. Our findings suggest that general multilingual models struggle to perform well, achieving 66.91% accuracy on COPAL-ID. South-East Asian-specific models achieve slightly better performance of 73.88% accuracy. Yet, this number still falls short of near-perfect human performance. This shows that these language models are still way behind in comprehending the local nuances of Indonesian.
CLMar 24, 2022
One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in IndonesiaAlham Fikri Aji, Genta Indra Winata, Fajri Koto et al.
NLP research is impeded by a lack of resources and awareness of the challenges presented by underrepresented languages and dialects. Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia's 700+ languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems. Finally, we provide general recommendations to help develop NLP technology not only for languages of Indonesia but also other underrepresented languages.
CLMay 31, 2022
NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local LanguagesGenta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya et al.
Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing resources for languages in Indonesia. Despite being the second most linguistically diverse country, most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia. Our resource includes datasets, a multi-task benchmark, and lexicons, as well as a parallel Indonesian-English dataset. We provide extensive analyses and describe the challenges when creating such resources. We hope that our work can spark NLP research on Indonesian and other underrepresented languages.
CLJun 5, 2023
On "Scientific Debt" in NLP: A Case for More Rigour in Language Model Pre-Training ResearchMade Nindyatama Nityasya, Haryo Akbarianto Wibowo, Alham Fikri Aji et al.
This evidence-based position paper critiques current research practices within the language model pre-training literature. Despite rapid recent progress afforded by increasingly better pre-trained language models (PLMs), current PLM research practices often conflate different possible sources of model improvement, without conducting proper ablation studies and principled comparisons between different models under comparable conditions. These practices (i) leave us ill-equipped to understand which pre-training approaches should be used under what circumstances; (ii) impede reproducibility and credit assignment; and (iii) render it difficult to understand: "How exactly does each factor contribute to the progress that we have today?" We provide a case in point by revisiting the success of BERT over its baselines, ELMo and GPT-1, and demonstrate how -- under comparable conditions where the baselines are tuned to a similar extent -- these baselines (and even-simpler variants thereof) can, in fact, achieve competitive or better performance than BERT. These findings demonstrate how disentangling different factors of model improvements can lead to valuable new insights. We conclude with recommendations for how to encourage and incentivize this line of work, and accelerate progress towards a better and more systematic understanding of what factors drive the progress of our foundation models today.
CLMay 10, 2022
ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most Diverse Translation Sample PairAlham Fikri Aji, Tirana Noor Fatyanosa, Radityo Eko Prasojo et al.
We release our synthetic parallel paraphrase corpus across 17 languages: Arabic, Catalan, Czech, German, English, Spanish, Estonian, French, Hindi, Indonesian, Italian, Dutch, Romanian, Russian, Swedish, Vietnamese, and Chinese. Our method relies only on monolingual data and a neural machine translation system to generate paraphrases, hence simple to apply. We generate multiple translation samples using beam search and choose the most lexically diverse pair according to their sentence BLEU. We compare our generated corpus with the \texttt{ParaBank2}. According to our evaluation, our synthetic paraphrase pairs are semantically similar and lexically diverse.
SDMar 29, 2022
Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise DistillationRendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji et al.
Several solutions for lightweight TTS have shown promising results. Still, they either rely on a hand-crafted design that reaches non-optimum size or use a neural architecture search but often suffer training costs. We present Nix-TTS, a lightweight TTS achieved via knowledge distillation to a high-quality yet large-sized, non-autoregressive, and end-to-end (vocoder-free) TTS teacher model. Specifically, we offer module-wise distillation, enabling flexible and independent distillation to the encoder and decoder module. The resulting Nix-TTS inherited the advantageous properties of being non-autoregressive and end-to-end from the teacher, yet significantly smaller in size, with only 5.23M parameters or up to 89.34% reduction of the teacher model; it also achieves over 3.04x and 8.36x inference speedup on Intel-i7 CPU and Raspberry Pi 3B respectively and still retains a fair voice naturalness and intelligibility compared to the teacher model. We provide pretrained models and audio samples of Nix-TTS.
83.5CYApr 23
Context-Aware Displacement Estimation from Mobile Phone Data: A Methodological FrameworkRajius Idzalika, Muhammad Rheza Muztahid, Radityo Eko Prasojo
Timely population displacement estimates are critical for humanitarian response during disasters, but traditional surveys and field assessments are slow. Mobile phone data enables near real-time tracking, yet existing approaches apply uniform displacement definitions regardless of individual mobility patterns, misclassifying regular commuters as displaced. We present a methodological framework addressing this through three innovations: (1) mobility profile classification distinguishing local residents from commuter types, (2) context-aware between-municipality displacement detection accounting for expected location by user type and day of week, and (3) operational uncertainty bounds derived from baseline coefficient of variation with a disaster adjustment factor, intended for humanitarian decision support rather than formal statistical inference. The framework produces three complementary metrics scaled to population with uncertainty bounds: displacement rates, origin-destination flows, and return dynamics. An Aparri case study following Super Typhoon Nando (2025, Philippines) applies the framework to vendor-provided daily locations from Globe Telecom. Context-aware detection reduced estimated between-municipality displacement by 1.6-2.7 percentage points on weekdays versus naive methods, attributable to the commuter exception but not independently validated. The method captures between-municipality displacement only. Within-municipality evacuation falls outside scope. The single-case demonstration establishes proof of concept. External validity requires application across multiple events and locations. The framework provides humanitarian actors with operational displacement information while preserving individual privacy through aggregation.
CLMar 19, 2024Code
Investigating Text Shortening Strategy in BERT: Truncation vs SummarizationMirza Alim Mutasodirin, Radityo Eko Prasojo
The parallelism of Transformer-based models comes at the cost of their input max-length. Some studies proposed methods to overcome this limitation, but none of them reported the effectiveness of summarization as an alternative. In this study, we investigate the performance of document truncation and summarization in text classification tasks. Each of the two was investigated with several variations. This study also investigated how close their performances are to the performance of full-text. We used a dataset of summarization tasks based on Indonesian news articles (IndoSum) to do classification tests. This study shows how the summaries outperform the majority of truncation method variations and lose to only one. The best strategy obtained in this study is taking the head of the document. The second is extractive summarization. This study explains what happened to the result, leading to further research in order to exploit the potential of document summarization as a shortening alternative. The code and data used in this work are publicly available in https://github.com/mirzaalimm/TruncationVsSummarization.
CLNov 6, 2020Code
Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language with Iterative Forward-TranslationHaryo Akbarianto Wibowo, Tatag Aziz Prawiro, Muhammad Ihsan et al.
In its daily use, the Indonesian language is riddled with informality, that is, deviations from the standard in terms of vocabulary, spelling, and word order. On the other hand, current available Indonesian NLP models are typically developed with the standard Indonesian in mind. In this work, we address a style-transfer from informal to formal Indonesian as a low-resource machine translation problem. We build a new dataset of parallel sentences of informal Indonesian and its formal counterpart. We benchmark several strategies to perform style transfer from informal to formal Indonesian. We also explore augmenting the training set with artificial forward-translated data. Since we are dealing with an extremely low-resource setting, we find that a phrase-based machine translation approach outperforms the Transformer-based approach. Alternatively, a pre-trained GPT-2 fined-tuned to this task performed equally well but costs more computational resource. Our findings show a promising step towards leveraging machine translation models for style transfer. Our code and data are available in https://github.com/haryoa/stif-indonesia
CLMar 19, 2024
Simple Hack for Transformers against Heavy Long-Text Classification on a Time- and Memory-Limited GPU ServiceMirza Alim Mutasodirin, Radityo Eko Prasojo, Achmad F. Abka et al.
Many NLP researchers rely on free computational services, such as Google Colab, to fine-tune their Transformer models, causing a limitation for hyperparameter optimization (HPO) in long-text classification due to the method having quadratic complexity and needing a bigger resource. In Indonesian, only a few works were found on long-text classification using Transformers. Most only use a small amount of data and do not report any HPO. In this study, using 18k news articles, we investigate which pretrained models are recommended to use based on the output length of the tokenizer. We then compare some hacks to shorten and enrich the sequences, which are the removals of stopwords, punctuation, low-frequency words, and recurring words. To get a fair comparison, we propose and run an efficient and dynamic HPO procedure that can be done gradually on a limited resource and does not require a long-running optimization library. Using the best hack found, we then compare 512, 256, and 128 tokens length. We find that removing stopwords while keeping punctuation and low-frequency words is the best hack. Some of our setups manage to outperform taking 512 first tokens using a smaller 128 or 256 first tokens which manage to represent the same information while requiring less computational resources. The findings could help developers to efficiently pursue optimal performance of the models using limited resources.
CLJan 3, 2022
Which Student is Best? A Comprehensive Knowledge Distillation Exam for Task-Specific BERT ModelsMade Nindyatama Nityasya, Haryo Akbarianto Wibowo, Rendi Chevi et al.
We perform knowledge distillation (KD) benchmark from task-specific BERT-base teacher models to various student models: BiLSTM, CNN, BERT-Tiny, BERT-Mini, and BERT-Small. Our experiment involves 12 datasets grouped in two tasks: text classification and sequence labeling in the Indonesian language. We also compare various aspects of distillations including the usage of word embeddings and unlabeled data augmentation. Our experiments show that, despite the rising popularity of Transformer-based models, using BiLSTM and CNN student models provide the best trade-off between performance and computational resource (CPU, RAM, and storage) compared to pruned BERT models. We further propose some quick wins on performing KD to produce small NLP models via efficient KD training mechanisms involving simple choices of loss functions, word embeddings, and unlabeled data preparation.
CLDec 30, 2020
Synthetic Source Language Augmentation for Colloquial Neural Machine TranslationAsrul Sani Ariesandy, Mukhlis Amien, Alham Fikri Aji et al.
Neural machine translation (NMT) is typically domain-dependent and style-dependent, and it requires lots of training data. State-of-the-art NMT models often fall short in handling colloquial variations of its source language and the lack of parallel data in this regard is a challenging hurdle in systematically improving the existing models. In this work, we develop a novel colloquial Indonesian-English test-set collected from YouTube transcript and Twitter. We perform synthetic style augmentation to the source of formal Indonesian language and show that it improves the baseline Id-En models (in BLEU) over the new test data.
CLDec 16, 2020
Costs to Consider in Adopting NLP for Your BusinessMade Nindyatama Nityasya, Haryo Akbarianto Wibowo, Radityo Eko Prasojo et al.
Recent advances in Natural Language Processing (NLP) have largely pushed deep transformer-based models as the go-to state-of-the-art technique without much regard to the production and utilization cost. Companies planning to adopt these methods into their business face difficulties because of the lack of machine, data, and human resources to build them. We compare both the performance and the cost of classical learning algorithms to the latest ones in common sequence and text labeling tasks. In our industrial datasets, we find that classical models often perform on par with deep neural ones despite the lower cost. We show the trade-off between performance gain and the cost across the models to give more insights for AI-pivoting business. Further, we call for more research into low-cost models, especially for under-resourced languages.