CLDec 19, 2022
NusaCrowd: Open Source Initiative for Indonesian NLP ResourcesSamuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji et al. · nvidia
We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
CLSep 19, 2023Code
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource LanguagesSamuel Cahyawijaya, Holy Lovenia, Fajri Koto et al.
Democratizing access to natural language processing (NLP) technology is crucial, especially for underrepresented and extremely low-resource languages. Previous research has focused on developing labeled and unlabeled corpora for these languages through online scraping and document translation. While these methods have proven effective and cost-efficient, we have identified limitations in the resulting corpora, including a lack of lexical diversity and cultural relevance to local communities. To address this gap, we conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content. In addition, we present the \datasetname{} benchmark, encompassing 12 underrepresented and extremely low-resource languages spoken by millions of individuals in Indonesia. Our empirical experiment results using existing multilingual large language models conclude the need to extend these models to more underrepresented languages. We release the NusaWrites dataset at https://github.com/IndoNLP/nusa-writes.
CLMay 30
Momento: Evaluating Persistent Memory and Reasoning with Multi-Session Agentic ConversationsAdril Putra Merin, David Anugraha, Ayu Purwarianti et al.
Recent advances in agentic AI have enabled agents to complete complex tasks through tool use, reasoning, and multi-step planning. Yet existing benchmarks evaluate agents within a single session, ignoring past actions, stated preferences, and prior decisions that agents must integrate to fulfill personalized user goals. We introduce Momento, a benchmark for persistent agentic task completion in multi-session service environments, requiring agents to take consequential, tool-mediated actions while resolving temporal dependencies and evolving user goals across sessions. Experimental results reveal that current agents fail primarily through misestimation of user state, treating prior session history as a reliable proxy for current context rather than stale information requiring re-validation, highlighting a substantial gap between current agent capabilities and realistic long-horizon human-agent interaction.
CLNov 21, 2023
IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local LanguagesMuhammad Farid Adilazuarda, Samuel Cahyawijaya, Genta Indra Winata et al.
Significant progress has been made on Indonesian NLP. Nevertheless, exploration of the code-mixing phenomenon in Indonesian is limited, despite many languages being frequently mixed with Indonesian in daily conversation. In this work, we explore code-mixing in Indonesian with four embedded languages, i.e., English, Sundanese, Javanese, and Malay; and introduce IndoRobusta, a framework to evaluate and improve the code-mixing robustness. Our analysis shows that the pre-training corpus bias affects the model's ability to better handle Indonesian-English code-mixing when compared to other local languages, despite having higher language diversity.
CLOct 12, 2023
QASiNa: Religious Domain Question Answering using Sirah NabawiyahMuhammad Razif Rizqullah, Ayu Purwarianti, Alham Fikri Aji
Nowadays, Question Answering (QA) tasks receive significant research focus, particularly with the development of Large Language Model (LLM) such as Chat GPT [1]. LLM can be applied to various domains, but it contradicts the principles of information transmission when applied to the Islamic domain. In Islam we strictly regulates the sources of information and who can give interpretations or tafseer for that sources [2]. The approach used by LLM to generate answers based on its own interpretation is similar to the concept of tafseer, LLM is neither an Islamic expert nor a human which is not permitted in Islam. Indonesia is the country with the largest Islamic believer population in the world [3]. With the high influence of LLM, we need to make evaluation of LLM in religious domain. Currently, there is only few religious QA dataset available and none of them using Sirah Nabawiyah especially in Indonesian Language. In this paper, we propose the Question Answering Sirah Nabawiyah (QASiNa) dataset, a novel dataset compiled from Sirah Nabawiyah literatures in Indonesian language. We demonstrate our dataset by using mBERT [4], XLM-R [5], and IndoBERT [6] which fine-tuned with Indonesian translation of SQuAD v2.0 [7]. XLM-R model returned the best performance on QASiNa with EM of 61.20, F1-Score of 75.94, and Substring Match of 70.00. We compare XLM-R performance with Chat GPT-3.5 and GPT-4 [1]. Both Chat GPT version returned lower EM and F1-Score with higher Substring Match, the gap of EM and Substring Match get wider in GPT-4. The experiment indicate that Chat GPT tends to give excessive interpretations as evidenced by its higher Substring Match scores compared to EM and F1-Score, even after providing instruction and context. This concludes Chat GPT is unsuitable for question answering task in religious domain especially for Islamic religion.
CLJul 21, 2022
NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian LanguagesSamuel Cahyawijaya, Alham Fikri Aji, Holy Lovenia et al.
At the center of the underlying issues that halt Indonesian natural language processing (NLP) research advancement, we find data scarcity. Resources in Indonesian languages, especially the local ones, are extremely scarce and underrepresented. Many Indonesian researchers do not publish their dataset. Furthermore, the few public datasets that we have are scattered across different platforms, thus makes performing reproducible and data-centric research in Indonesian NLP even more arduous. Rising to this challenge, we initiate the first Indonesian NLP crowdsourcing effort, NusaCrowd. NusaCrowd strives to provide the largest datasheets aggregation with standardized data loading for NLP tasks in all Indonesian languages. By enabling open and centralized access to Indonesian NLP resources, we hope NusaCrowd can tackle the data scarcity problem hindering NLP progress in Indonesia and bring NLP practitioners to move towards collaboration.
SDJun 1, 2022
Speech Artifact Removal from EEG Recordings of Spoken Word Production with Tensor DecompositionHoly Lovenia, Hiroki Tanaka, Sakriani Sakti et al.
Research about brain activities involving spoken word production is considerably underdeveloped because of the undiscovered characteristics of speech artifacts, which contaminate electroencephalogram (EEG) signals and prevent the inspection of the underlying cognitive processes. To fuel further EEG research with speech production, a method using three-mode tensor decomposition (time x space x frequency) is proposed to perform speech artifact removal. Tensor decomposition enables simultaneous inspection of multiple modes, which suits the multi-way nature of EEG data. In a picture-naming task, we collected raw data with speech artifacts by placing two electrodes near the mouth to record lip EMG. Based on our evaluation, which calculated the correlation values between grand-averaged speech artifacts and the lip EMG, tensor decomposition outperformed the former methods that were based on independent component analysis (ICA) and blind source separation (BSS), both in detecting speech artifact (0.985) and producing clean data (0.101). Our proposed method correctly preserved the components unrelated to speech, which was validated by computing the correlation value between the grand-averaged raw data without EOG and cleaned data before the speech onset (0.92-0.94).
CLOct 12, 2023
Low-Resource Clickbait Spoiling for Indonesian via Question AnsweringNi Putu Intan Maharani, Ayu Purwarianti, Alham Fikri Aji
Clickbait spoiling aims to generate a short text to satisfy the curiosity induced by a clickbait post. As it is a newly introduced task, the dataset is only available in English so far. Our contributions include the construction of manually labeled clickbait spoiling corpus in Indonesian and an evaluation on using cross-lingual zero-shot question answering-based models to tackle clikcbait spoiling for low-resource language like Indonesian. We utilize selection of multilingual language models. The experimental results suggest that XLM-RoBERTa (large) model outperforms other models for phrase and passage spoilers, meanwhile, mDeBERTa (base) model outperforms other models for multipart spoilers.
CLSep 23, 2024
Towards Efficient and Robust VQA-NLE Data Generation with Large Vision-Language ModelsPatrick Amadeus Irawan, Genta Indra Winata, Samuel Cahyawijaya et al.
Natural Language Explanation (NLE) aims to elucidate the decision-making process by providing detailed, human-friendly explanations in natural language. It helps demystify the decision-making processes of large vision-language models (LVLMs) through the use of language models. While existing methods for creating a Vision Question-Answering with Natural Language Explanation (VQA-NLE) datasets can provide explanations, they heavily rely on human annotations that are time-consuming and costly. In this study, we propose a novel approach that leverages LVLMs to efficiently generate high-quality synthetic VQA-NLE datasets. By evaluating our synthetic data, we showcase how advanced prompting techniques can lead to the production of high-quality VQA-NLE data. Our findings indicate that this proposed method achieves up to 20x faster than human annotation, with only a minimal decrease in qualitative metrics, achieving robust quality that is nearly equivalent to human-annotated data. Furthermore, we show that incorporating visual prompts significantly enhances the relevance of text generation. Our study paves the way for a more efficient and robust automated generation of multi-modal NLE data, offering a promising solution to the problem.
CLNov 2, 2023
IndoToD: A Multi-Domain Indonesian Benchmark For End-to-End Task-Oriented Dialogue SystemsMuhammad Dehan Al Kautsar, Rahmah Khoirussyifa' Nurdini, Samuel Cahyawijaya et al.
Task-oriented dialogue (ToD) systems have been mostly created for high-resource languages, such as English and Chinese. However, there is a need to develop ToD systems for other regional or local languages to broaden their ability to comprehend the dialogue contexts in various languages. This paper introduces IndoToD, an end-to-end multi domain ToD benchmark in Indonesian. We extend two English ToD datasets to Indonesian, comprising four different domains by delexicalization to efficiently reduce the size of annotations. To ensure a high-quality data collection, we hire native speakers to manually translate the dialogues. Along with the original English datasets, these new Indonesian datasets serve as an effective benchmark for evaluating Indonesian and English ToD systems as well as exploring the potential benefits of cross-lingual and bilingual transfer learning approaches.
CLNov 21, 2023
The Obscure Limitation of Modular Multilingual Language ModelsMuhammad Farid Adilazuarda, Samuel Cahyawijaya, Ayu Purwarianti
We expose the limitation of modular multilingual language models (MLMs) in multilingual inference scenarios with unknown languages. Existing evaluations of modular MLMs exclude the involvement of language identification (LID) modules, which obscures the performance of real-case multilingual scenarios of modular MLMs. In this work, we showcase the effect of adding LID on the multilingual evaluation of modular MLMs and provide discussions for closing the performance gap of caused by the pipelined approach of LID and modular MLMs.
CLNov 2, 2023
Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in IndonesiaLucky Susanto, Ryandito Diandaru, Adila Krisnadhi et al.
Neural machine translation (NMT) for low-resource local languages in Indonesia faces significant challenges, including the need for a representative benchmark and limited data availability. This work addresses these challenges by comprehensively analyzing training NMT systems for four low-resource local languages in Indonesia: Javanese, Sundanese, Minangkabau, and Balinese. Our study encompasses various training approaches, paradigms, data sizes, and a preliminary study into using large language models for synthetic low-resource languages parallel data generation. We reveal specific trends and insights into practical strategies for low-resource language translation. Our research demonstrates that despite limited computational resources and textual data, several of our NMT systems achieve competitive performances, rivaling the translation quality of zero-shot gpt-3.5-turbo. These findings significantly advance NMT for low-resource languages, offering valuable guidance for researchers in similar contexts.
CLJan 13
Mechanisms are Transferable: Data-Efficient Low-Resource Adaptation via Circuit-Targeted Supervised Fine-TuningKhumaisa Nur'aini, Ayu Purwarianti, Alham Fikri Aji et al.
Adapting LLMs to low-resource languages is difficult: labeled data is scarce, full-model fine-tuning is unstable, and continued cross-lingual tuning can cause catastrophic forgetting. We propose Circuit-Targeted Supervised Fine-Tuning (CT-SFT): a counterfactual-free adaptation of CD-T (Contextual Decomposition Transformer) that uses a label-balanced mean baseline and task-directional relevance scoring to identify a sparse set of task-relevant attention heads in a proxy-language checkpoint, then transfer learns to a target language by updating only those heads (plus LayerNorm) via head-level gradient masking. Across NusaX-Senti and XNLI, CT-SFT improves cross-lingual accuracy over continued full fine-tuning while updating only a small subset of model parameters. We find an editing-preserving trade-off: harder transfers favor editing circuit heads, while easier transfers often favor near-zero (i.e., low-relevance heads) updates, preserving the source mechanism. CT-SFT also substantially reduces catastrophic forgetting, preserving proxy/source-language competence during transfer.
CLOct 15, 2023
Domain-Specific Language Model Post-Training for Indonesian Financial NLPNi Putu Intan Maharani, Yoga Yustiawan, Fauzy Caesar Rochim et al.
BERT and IndoBERT have achieved impressive performance in several NLP tasks. There has been several investigation on its adaption in specialized domains especially for English language. We focus on financial domain and Indonesian language, where we perform post-training on pre-trained IndoBERT for financial domain using a small scale of Indonesian financial corpus. In this paper, we construct an Indonesian self-supervised financial corpus, Indonesian financial sentiment analysis dataset, Indonesian financial topic classification dataset, and release a family of BERT models for financial NLP. We also evaluate the effectiveness of domain-specific post-training on sentiment analysis and topic classification tasks. Our findings indicate that the post-training increases the effectiveness of a language model when it is fine-tuned to domain-specific downstream tasks.
CLNov 3, 2023
Indo LEGO-ABSA: A Multitask Generative Aspect Based Sentiment Analysis for Indonesian LanguageRandy Zakya Suchrady, Ayu Purwarianti
Aspect-based sentiment analysis is a method in natural language processing aimed at identifying and understanding sentiments related to specific aspects of an entity. Aspects are words or phrases that represent an aspect or attribute of a particular entity. Previous research has utilized generative pre-trained language models to perform aspect-based sentiment analysis. LEGO-ABSA is one framework that has successfully employed generative pre-trained language models in aspect-based sentiment analysis, particularly in English. LEGO-ABSA uses a multitask learning and prompting approach to enhance model performance. However, the application of this approach has not been done in the context of Bahasa Indonesia. Therefore, this research aims to implement the multitask learning and prompting approach in aspect-based sentiment analysis for Bahasa Indonesia using generative pre-trained language models. In this study, the Indo LEGO-ABSA model is developed, which is an aspect-based sentiment analysis model utilizing generative pre-trained language models and trained with multitask learning and prompting. Indo LEGO-ABSA is trained with a hotel domain dataset in the Indonesian language. The obtained results include an f1-score of 79.55% for the Aspect Sentiment Triplet Extraction task, 86.09% for Unified Aspect-based Sentiment Analysis, 79.85% for Aspect Opinion Pair Extraction, 87.45% for Aspect Term Extraction, and 88.09% for Opinion Term Extraction. Indo LEGO-ABSA adopts the LEGO-ABSA framework that employs the T5 model, specifically mT5, by applying multitask learning to train all tasks within aspect-based sentiment analysis.
LGJun 2, 2025Code
Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and AccountabilityGenta Indra Winata, David Anugraha, Emmy Liu et al. · amazon-science
High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about dataset construction and properties. While existing tools such as datasheets aim to promote transparency, they are largely descriptive and do not provide standardized, measurable methods for evaluating data quality. Similarly, metadata requirements at conferences promote accountability but are inconsistently enforced. To address these limitations, this position paper advocates for the integration of systematic, rubric-based evaluation metrics into the dataset review process-particularly as submission volumes continue to grow. We also explore scalable, cost-effective methods for synthetic data generation, including dedicated tools and LLM-as-a-judge approaches, to support more efficient evaluation. As a call to action, we introduce DataRubrics, a structured framework for assessing the quality of both human- and model-generated datasets. Leveraging recent advances in LLM-based evaluation, DataRubrics offers a reproducible, scalable, and actionable solution for dataset quality assessment, enabling both authors and reviewers to uphold higher standards in data-centric research. We also release code to support reproducibility of LLM-based evaluations at https://github.com/datarubrics/datarubrics.
CLAug 22, 2024
Enhancing Natural Language Inference Performance with Knowledge Graph for COVID-19 Automated Fact-Checking in Indonesian LanguageArief Purnama Muharram, Ayu Purwarianti
Automated fact-checking is a key strategy to overcome the spread of COVID-19 misinformation on the internet. These systems typically leverage deep learning approaches through Natural Language Inference (NLI) to verify the truthfulness of information based on supporting evidence. However, one challenge that arises in deep learning is performance stagnation due to a lack of knowledge during training. This study proposes using a Knowledge Graph (KG) as external knowledge to enhance NLI performance for automated COVID-19 fact-checking in the Indonesian language. The proposed model architecture comprises three modules: a fact module, an NLI module, and a classifier module. The fact module processes information from the KG, while the NLI module handles semantic relationships between the given premise and hypothesis. The representation vectors from both modules are concatenated and fed into the classifier module to produce the final result. The model was trained using the generated Indonesian COVID-19 fact-checking dataset and the COVID-19 KG Bahasa Indonesia. Our study demonstrates that incorporating KGs can significantly improve NLI performance in fact-checking, achieving the best accuracy of 0.8616. This suggests that KGs are a valuable component for enhancing NLI performance in automated fact-checking.
LGJun 13, 2024Code
MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer DecodingZayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti et al.
Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV's potential for efficient deployment of transformer models at scale. We provide code at https://github.com/zaydzuhri/pythia-mlkv
CLOct 16, 2024
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global CuisinesGenta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan et al.
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.
CLApr 9, 2024
Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian LanguagesSamuel Cahyawijaya, Holy Lovenia, Fajri Koto et al.
Large language models (LLMs) show remarkable human-like capability in various domains and languages. However, a notable quality gap arises in low-resource languages, e.g., Indonesian indigenous languages, rendering them ineffective and inefficient in such linguistic contexts. To bridge this quality gap, we introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder-decoder architectures across a range of model sizes. We highlight Cendol's effectiveness across a diverse array of tasks, attaining 20% improvement, and demonstrate its capability to generalize to unseen tasks and indigenous languages of Indonesia. Furthermore, Cendol models showcase improved human favorability despite their limitations in capturing indigenous knowledge and cultural values in Indonesia. In addition, we discuss the shortcomings of parameter-efficient tunings, such as LoRA, for language adaptation. Alternatively, we propose the usage of vocabulary adaptation to enhance efficiency. Lastly, we evaluate the safety of Cendol and showcase that safety in pre-training in one language such as English is transferable to low-resource languages, such as Indonesian, even without RLHF and safety fine-tuning.
CLJan 11, 2024
LinguAlchemy: Fusing Typological and Geographical Elements for Unseen Language GeneralizationMuhammad Farid Adilazuarda, Samuel Cahyawijaya, Alham Fikri Aji et al.
Pretrained language models (PLMs) have become remarkably adept at task and language generalization. Nonetheless, they often fail when faced with unseen languages. In this work, we present LinguAlchemy, a regularization method that incorporates various linguistic information covering typological, geographical, and phylogenetic features to align PLMs representation to the corresponding linguistic information on each language. Our LinguAlchemy significantly improves the performance of mBERT and XLM-R on low-resource languages in multiple downstream tasks such as intent classification, news classification, and semantic relatedness compared to fully finetuned models and displaying a high degree of unseen language generalization. We further introduce AlchemyScale and AlchemyTune, extension of LinguAlchemy which adjusts the linguistic regularization weights automatically, alleviating the need for hyperparameter search.
CLFeb 21, 2024
Could We Have Had Better Multilingual LLMs If English Was Not the Central Language?Ryandito Diandaru, Lucky Susanto, Zilu Tang et al.
Large Language Models (LLMs) demonstrate strong machine translation capabilities on languages they are trained on. However, the impact of factors beyond training data size on translation performance remains a topic of debate, especially concerning languages not directly encountered during training. Our study delves into Llama2's translation capabilities. By modeling a linear relationship between linguistic feature distances and machine translation scores, we ask ourselves if there are potentially better central languages for LLMs other than English. Our experiments show that the 7B Llama2 model yields above 10 BLEU when translating into all languages it has seen, which rarely happens for languages it has not seen. Most translation improvements into unseen languages come from scaling up the model size rather than instruction tuning or increasing shot count. Furthermore, our correlation analysis reveals that syntactic similarity is not the only linguistic factor that strongly correlates with machine translation scores. Interestingly, we discovered that under specific circumstances, some languages (e.g. Swedish, Catalan), despite having significantly less training data, exhibit comparable correlation levels to English. These insights challenge the prevailing landscape of LLMs, suggesting that models centered around languages other than English could provide a more efficient foundation for multilingual applications.
CLJul 29, 2025
IndoPref: A Multi-Domain Pairwise Preference Dataset for IndonesianVanessa Rebecca Wiyono, David Anugraha, Ayu Purwarianti et al.
Over 200 million people speak Indonesian, yet the language remains significantly underrepresented in preference-based research for large language models (LLMs). Most existing multilingual datasets are derived from English translations, often resulting in content that lacks cultural and linguistic authenticity. To address this gap, we introduce IndoPref, the first fully human-authored and multi-domain Indonesian preference dataset designed to evaluate the naturalness and quality of LLM-generated text. The dataset contains 522 prompts and yields 4,099 human-annotated pairwise preferences from comparisons across five instruction-tuned LLMs. All annotations are natively written in Indonesian with strong inter-annotator agreement, measured by Krippendorff's alpha. Our benchmark spans 10 diverse categories, enabling practitioners to identify LLMs' fine-grained strengths and weaknesses.
CLNov 27, 2024
Continual Learning in Machine Speech Chain Using Gradient Episodic MemoryGeoffrey Tyndall, Kurniawati Azizah, Dipta Tanaya et al.
Continual learning for automatic speech recognition (ASR) systems poses a challenge, especially with the need to avoid catastrophic forgetting while maintaining performance on previously learned tasks. This paper introduces a novel approach leveraging the machine speech chain framework to enable continual learning in ASR using gradient episodic memory (GEM). By incorporating a text-to-speech (TTS) component within the machine speech chain, we support the replay mechanism essential for GEM, allowing the ASR model to learn new tasks sequentially without significant performance degradation on earlier tasks. Our experiments, conducted on the LJ Speech dataset, demonstrate that our method outperforms traditional fine-tuning and multitask learning approaches, achieving a substantial error rate reduction while maintaining high performance across varying noise conditions. We showed the potential of our semi-supervised machine speech chain approach for effective and efficient continual learning in speech recognition.
CVDec 13, 2025
Moment and Highlight Detection via MLLM Frame SegmentationI Putu Andika Bagas Jiwanta, Ayu Purwarianti
Detecting video moments and highlights from natural-language queries have been unified by transformer-based methods. Other works use generative Multimodal LLM (MLLM) to predict moments and/or highlights as text timestamps, utilizing its reasoning capability. While effective, text-based generation cannot provide direct gradients for frame-level predictions because the model only emits language tokens. Although recent Reinforcement Learning (RL) methods attempt to address the issue, we propose a novel approach by applying segmentation objectives directly on the LLM's output tokens. The LLM is fed with a fixed number of frames alongside a prompt that enforces it to output a sequence of continuous "0" and/or "1" characters, with one character per frame. The "0"/"1" characters benefit from the LLM's inherent language capability while also acting as background and foreground probabilities, respectively. Training employs segmentation losses on the probabilities alongside a normal causal LM loss. At inference, beam search generates sequence and logits, acting as moments and saliency scores, respectively. Despite sampling only 25 frames -- less than half of comparable methods -- our method achieved strong highlight detection (56.74 HIT@1) on QVHighlights. Additionally, our efficient method scores above the baseline (35.28 MAP) for moment retrieval. Empirically, segmentation losses provide a stable complementary learning signal even when the causal LM loss plateaus.
CLApr 29, 2025
What Causes Knowledge Loss in Multilingual Language Models?Maria Khelli, Samuel Cahyawijaya, Ayu Purwarianti et al.
Cross-lingual transfer in natural language processing (NLP) models enhances multilingual performance by leveraging shared linguistic knowledge. However, traditional methods that process all data simultaneously often fail to mimic real-world scenarios, leading to challenges like catastrophic forgetting, where fine-tuning on new tasks degrades performance on previously learned ones. Our study explores this issue in multilingual contexts, focusing on linguistic differences affecting representational learning rather than just model parameters. We experiment with 52 languages using LoRA adapters of varying ranks to evaluate non-shared, partially shared, and fully shared parameters. Our aim is to see if parameter sharing through adapters can mitigate forgetting while preserving prior knowledge. We find that languages using non-Latin scripts are more susceptible to catastrophic forgetting, whereas those written in Latin script facilitate more effective cross-lingual transfer.
LGFeb 3, 2025
QLESS: A Quantized Approach for Data Valuation and Selection in Large Language Model Fine-TuningMoses Ananta, Muhammad Farid Adilazuarda, Zayd Muhammad Kawakibi Zuhri et al.
Fine-tuning large language models (LLMs) is often constrained by the computational costs of processing massive datasets. We propose \textbf{QLESS} (Quantized Low-rank Gradient Similarity Search), which integrates gradient quantization with the LESS framework to enable memory-efficient data valuation and selection. QLESS employs a two-step compression process: first, it obtains low-dimensional gradient representations through LoRA-based random projection; then, it quantizes these gradients to low-bitwidth representations. Experiments on multiple LLM architectures (LLaMA, Mistral, Qwen) and benchmarks (MMLU, BBH, TyDiQA) show that QLESS achieves comparable data selection performance to LESS while reducing memory usage by up to 16x. Even 1-bit gradient quantization preserves data valuation quality. These findings underscore QLESS as a practical, scalable approach to identifying informative examples within strict memory constraints.
CLNov 14, 2024
DriveThru: a Document Extraction Platform and Benchmark Datasets for Indonesian Local Language ArchivesMohammad Rifqi Farhansyah, Muhammad Zuhdi Fikri Johari, Afinzaki Amiral et al.
Indonesia is one of the most diverse countries linguistically. However, despite this linguistic diversity, Indonesian languages remain underrepresented in Natural Language Processing (NLP) research and technologies. In the past two years, several efforts have been conducted to construct NLP resources for Indonesian languages. However, most of these efforts have been focused on creating manual resources thus difficult to scale to more languages. Although many Indonesian languages do not have a web presence, locally there are resources that document these languages well in printed forms such as books, magazines, and newspapers. Digitizing these existing resources will enable scaling of Indonesian language resource construction to many more languages. In this paper, we propose an alternative method of creating datasets by digitizing documents, which have not previously been used to build digital language resources in Indonesia. DriveThru is a platform for extracting document content utilizing Optical Character Recognition (OCR) techniques in its system to provide language resource building with less manual effort and cost. This paper also studies the utility of current state-of-the-art LLM for post-OCR correction to show the capability of increasing the character accuracy rate (CAR) and word accuracy rate (WAR) compared to off-the-shelf OCR.
CLOct 11, 2024
Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models with Diverse Speech VariabilitiesAulia Adila, Dessi Lestari, Ayu Purwarianti et al.
An ideal speech recognition model has the capability to transcribe speech accurately under various characteristics of speech signals, such as speaking style (read and spontaneous), speech context (formal and informal), and background noise conditions (clean and moderate). Building such a model requires a significant amount of training data with diverse speech characteristics. Currently, Indonesian data is dominated by read, formal, and clean speech, leading to a scarcity of Indonesian data with other speech variabilities. To develop Indonesian automatic speech recognition (ASR), we present our research on state-of-the-art speech recognition models, namely Massively Multilingual Speech (MMS) and Whisper, as well as compiling a dataset comprising Indonesian speech with variabilities to facilitate our study. We further investigate the models' predictive ability to transcribe Indonesian speech data across different variability groups. The best results were achieved by the Whisper fine-tuned model across datasets with various characteristics, as indicated by the decrease in word error rate (WER) and character error rate (CER). Moreover, we found that speaking style variability affected model performance the most.
CLJun 14, 2024
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian LanguagesHoly Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar et al.
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.
CLJan 21, 2022
A Comparative Study on Language Models for Task-Oriented Dialogue SystemsVinsen Marselino Andreas, Genta Indra Winata, Ayu Purwarianti
The recent development of language models has shown promising results by achieving state-of-the-art performance on various natural language tasks by fine-tuning pretrained models. In task-oriented dialogue (ToD) systems, language models can be used for end-to-end training without relying on dialogue state tracking to track the dialogue history but allowing the language models to generate responses according to the context given as input. This paper conducts a comparative study to show the effectiveness and strength of using recent pretrained models for fine-tuning, such as BART and T5, on endto-end ToD systems. The experimental results show substantial performance improvements after language model fine-tuning. The models produce more fluent responses after adding knowledge to the context that guides the model to avoid hallucination and generate accurate entities in the generated responses. Furthermore, we found that BART and T5 outperform GPT-based models in BLEU and F1 scores and achieve state-of-the-art performance in a ToD system.
CLApr 16, 2021
IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language GenerationSamuel Cahyawijaya, Genta Indra Winata, Bryan Wilie et al.
Natural language generation (NLG) benchmarks provide an important avenue to measure progress and develop better NLG systems. Unfortunately, the lack of publicly available NLG benchmarks for low-resource languages poses a challenging barrier for building NLG systems that work well for languages with limited amounts of data. Here we introduce IndoNLG, the first benchmark to measure natural language generation (NLG) progress in three low-resource -- yet widely spoken -- languages of Indonesia: Indonesian, Javanese, and Sundanese. Altogether, these languages are spoken by more than 100 million native speakers, and hence constitute an important use case of NLG systems today. Concretely, IndoNLG covers six tasks: summarization, question answering, chit-chat, and three different pairs of machine translation (MT) tasks. We collate a clean pretraining corpus of Indonesian, Sundanese, and Javanese datasets, Indo4B-Plus, which is used to pretrain our models: IndoBART and IndoGPT. We show that IndoBART and IndoGPT achieve competitive performance on all tasks -- despite using only one-fifth the parameters of a larger multilingual model, mBART-LARGE (Liu et al., 2020). This finding emphasizes the importance of pretraining on closely related, local languages to achieve more efficient learning and faster inference for very low-resource languages like Javanese and Sundanese.
CLSep 29, 2020
Sequence-to-Sequence Learning for Indonesian Automatic Question GeneratorFerdiant Joshua Muis, Ayu Purwarianti
Automatic question generation is defined as the task of automating the creation of question given a various of textual data. Research in automatic question generator (AQG) has been conducted for more than 10 years, mainly focused on factoid question. In all these studies, the state-of-the-art is attained using sequence-to-sequence approach. However, AQG system for Indonesian has not ever been researched intensely. In this work we construct an Indonesian automatic question generator, adapting the architecture from some previous works. In summary, we used sequence-to-sequence approach using BiGRU, BiLSTM, and Transformer with additional linguistic features, copy mechanism, and coverage mechanism. Since there is no public large dan popular Indonesian dataset for question generation, we translated SQuAD v2.0 factoid question answering dataset, with additional Indonesian TyDiQA dev set for testing. The system achieved BLEU1, BLEU2, BLEU3, BLEU4, and ROUGE-L score at 38,35, 20,96, 10,68, 5,78, and 43,4 for SQuAD, and 39.9, 20.78, 10.26, 6.31, 44.13 for TyDiQA, respectively. The system performed well when the expected answers are named entities and are syntactically close with the context explaining them. Additionally, from native Indonesian perspective, the best questions generated by our best models on their best cases are acceptable and reasonably useful.
CLSep 15, 2020
Improving Joint Layer RNN based Keyphrase Extraction by Using Syntactical FeaturesMiftahul Mahfuzh, Sidik Soleman, Ayu Purwarianti
Keyphrase extraction as a task to identify important words or phrases from a text, is a crucial process to identify main topics when analyzing texts from a social media platform. In our study, we focus on text written in Indonesia language taken from Twitter. Different from the original joint layer recurrent neural network (JRNN) with output of one sequence of keywords and using only word embedding, here we propose to modify the input layer of JRNN to extract more than one sequence of keywords by additional information of syntactical features, namely part of speech, named entity types, and dependency structures. Since JRNN in general requires a large amount of data as the training examples and creating those examples is expensive, we used a data augmentation method to increase the number of training examples. Our experiment had shown that our method outperformed the baseline methods. Our method achieved .9597 in accuracy and .7691 in F1.
CLSep 13, 2020
Combining Word and Character Vector Representation on Neural Machine TranslationK. M. Shahih, Ayu Purwarianti
This paper describes combinations of word vector representation and character vector representation in English-Indonesian neural machine translation (NMT). Six configurations of NMT models were built with different input vector representations: word-based, combination of word and character representation using bidirectional LSTM(bi-LSTM), combination of word and character representation using CNN, combination of word and character representation by combining bi-LSTM and CNN by three different vector operations: addition, pointwise multiplication, and averaging. The experiment results showed that NMT models with concatenation of word and character representation obtained BLEU score higher than baseline model, ranging from 9.14 points to 11.65 points, for all models that combining both word and character representation, except the model that combining word and character representation using both bi-LSTM and CNN by addition operation. The highest BLEU score achieved was 42.48 compared to the 30.83 of the baseline model.
CLSep 12, 2020
Improving Bi-LSTM Performance for Indonesian Sentiment Analysis Using Paragraph VectorAyu Purwarianti, Ida Ayu Putu Ari Crisdayanti
Bidirectional Long Short-Term Memory Network (Bi-LSTM) has shown promising performance in sentiment classification task. It processes inputs as sequence of information. Due to this behavior, sentiment predictions by Bi-LSTM were influenced by words sequence and the first or last phrases of the texts tend to have stronger features than other phrases. Meanwhile, in the problem scope of Indonesian sentiment analysis, phrases that express the sentiment of a document might not appear in the first or last part of the document that can lead to incorrect sentiment classification. To this end, we propose the using of an existing document representation method called paragraph vector as additional input features for Bi-LSTM. This vector provides information context of the document for each sequence processing. The paragraph vector is simply concatenated to each word vector of the document. This representation also helps to differentiate ambiguous Indonesian words. Bi-LSTM and paragraph vector were previously used as separate methods. Combining the two methods has shown a significant performance improvement of Indonesian sentiment analysis model. Several case studies on testing data showed that the proposed method can handle the sentiment phrases position problem encountered by Bi-LSTM.
CLSep 12, 2020
Improving Indonesian Text Classification Using Multilingual Language ModelIlham Firdausi Putra, Ayu Purwarianti
Compared to English, the amount of labeled data for Indonesian text classification tasks is very small. Recently developed multilingual language models have shown its ability to create multilingual representations effectively. This paper investigates the effect of combining English and Indonesian data on building Indonesian text classification (e.g., sentiment analysis and hate speech) using multilingual language models. Using the feature-based approach, we observe its performance on various data sizes and total added English data. The experiment showed that the addition of English data, especially if the amount of Indonesian data is small, improves performance. Using the fine-tuning approach, we further showed its effectiveness in utilizing the English language to build Indonesian text classification models.
CLSep 12, 2020
Relation Detection for Indonesian Language using Deep Neural Network -- Support Vector MachineRamos Janoah Hasudungan, Ayu Purwarianti
Relation Detection is a task to determine whether two entities are related or not. In this paper, we employ neural network to do relation detection between two named entities for Indonesian Language. We used feature such as word embedding, position embedding, POS-Tag embedding, and character embedding. For the model, we divide the model into two parts: Front-part classifier (Convolutional layer or LSTM layer) and Back-part classifier (Dense layer or SVM). We did grid search method of neural network hyper parameter and SVM. We used 6000 Indonesian sentences for training process and 1,125 for testing. The best result is 0.8083 on F1-Score using Convolutional Layer as front-part and SVM as back-part.
CLSep 11, 2020
Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity TaggerDevin Hoesen, Ayu Purwarianti
Researches on Indonesian named entity (NE) tagger have been conducted since years ago. However, most did not use deep learning and instead employed traditional machine learning algorithms such as association rule, support vector machine, random forest, naïve bayes, etc. In those researches, word lists as gazetteers or clue words were provided to enhance the accuracy. Here, we attempt to employ deep learning in our Indonesian NE tagger. We use long short-term memory (LSTM) as the topology since it is the state-of-the-art of NE tagger. By using LSTM, we do not need a word list in order to enhance the accuracy. Basically, there are two main things that we investigate. The first is the output layer of the network: Softmax vs conditional random field (CRF). The second is the usage of part of speech (POS) tag embedding input layer. Using 8400 sentences as the training data and 97 sentences as the evaluation data, we find that using POS tag embedding as additional input improves the performance of our Indonesian NE tagger. As for the comparison between Softmax and CRF, we find that both architectures have a weakness in classifying an NE tag.
CLSep 11, 2020
Coreference Resolution System for Indonesian Text with Mention Pair Method and Singleton Exclusion using Convolutional Neural NetworkTurfa Auliarachman, Ayu Purwarianti
Neural network has shown promising performance on coreference resolution systems that uses mention pair method. With deep neural network, it can learn hidden and deep relations between two mentions. However, there is no work on coreference resolution for Indonesian text that uses this learning technique. The state-of-the-art system for Indonesian text only states the use of lexical and syntactic features can improve the existing coreference resolution system. In this paper, we propose a new coreference resolution system for Indonesian text with mention pair method that uses deep neural network to learn the relations of the two mentions. In addition to lexical and syntactic features, in order to learn the representation of the mentions words and context, we use word embeddings and feed them to Convolutional Neural Network (CNN). Furthermore, we do singleton exclusion using singleton classifier component to prevent singleton mentions entering any entity clusters at the end. Achieving 67.37% without singleton exclusion, 63.27% with trained singleton classifier, and 75.95% with gold singleton classifier on CoNLL average F1 score, our proposed system outperforms the state-of-the-art system.
CLSep 11, 2020
IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language UnderstandingBryan Wilie, Karissa Vincentio, Genta Indra Winata et al.
Although Indonesian is known to be the fourth most frequently used language over the internet, the research progress on this language in the natural language processing (NLP) is slow-moving due to a lack of available resources. In response, we introduce the first-ever vast resource for the training, evaluating, and benchmarking on Indonesian natural language understanding (IndoNLU) tasks. IndoNLU includes twelve tasks, ranging from single sentence classification to pair-sentences sequence labeling with different levels of complexity. The datasets for the tasks lie in different domains and styles to ensure task diversity. We also provide a set of Indonesian pre-trained models (IndoBERT) trained from a large and clean Indonesian dataset Indo4B collected from publicly available sources such as social media texts, blogs, news, and websites. We release baseline models for all twelve tasks, as well as the framework for benchmark evaluation, and thus it enables everyone to benchmark their system performances.
CLMay 12, 2015
Indonesian Social Media Sentiment Analysis With Sarcasm DetectionEdwin Lunando, Ayu Purwarianti
Sarcasm is considered one of the most difficult problem in sentiment analysis. In our ob-servation on Indonesian social media, for cer-tain topics, people tend to criticize something using sarcasm. Here, we proposed two additional features to detect sarcasm after a common sentiment analysis is conducted. The features are the negativity information and the number of interjection words. We also employed translated SentiWordNet in the sentiment classification. All the classifications were conducted with machine learning algorithms. The experimental results showed that the additional features are quite effective in the sarcasm detection.