Charibeth Cheng

CL
h-index2
11papers
1,861citations
Novelty35%
AI Score37

11 Papers

CLApr 7, 2022
Towards Automatic Construction of Filipino WordNet: Word Sense Induction and Synset Induction Using Sentence Embeddings

Dan John Velasco, Axel Alba, Trisha Gail Pelagio et al.

Wordnets are indispensable tools for various natural language processing applications. Unfortunately, wordnets get outdated, and producing or updating wordnets can be slow and costly in terms of time and resources. This problem intensifies for low-resource languages. This study proposes a method for word sense induction and synset induction using only two linguistic resources, namely, an unlabeled corpus and a sentence embeddings-based language model. The resulting sense inventory and synonym sets can be used in automatically creating a wordnet. We applied this method on a corpus of Filipino text. The sense inventory and synsets were evaluated by matching them with the sense inventory of the machine translated Princeton WordNet, as well as comparing the synsets to the Filipino WordNet. This study empirically shows that the 30% of the induced word senses are valid and 40% of the induced synsets are valid in which 20% are novel synsets.

CLApr 6, 2022
Using Synthetic Data for Conversational Response Generation in Low-resource Settings

Gabriel Louis Tan, Adrian Paule Ty, Schuyler Ng et al.

Response generation is a task in natural language processing (NLP) where a model is trained to respond to human statements. Conversational response generators take this one step further with the ability to respond within the context of previous responses. While there are existing techniques for training such models, they all require an abundance of conversational data which are not always available for low-resource languages. In this research, we make three contributions. First, we released the first Filipino conversational dataset collected from a popular Philippine online forum, which we named the PEx Conversations Dataset. Second, we introduce a data augmentation (DA) methodology for Filipino data by employing a Tagalog RoBERTa model to increase the size of the existing corpora. Lastly, we published the first Filipino conversational response generator capable of generating responses related to the previous 3 responses. With the supplementary synthetic data, we were able to improve the performance of the response generator by up to 12.2% in BERTScore, 10.7% in perplexity, and 11.7% in content word usage as compared to training with zero synthetic data.

SEOct 2, 2021Code
The Need for a Fine-grained approach in Just-in-Time Defect Prediction

Giuseppe Ng, Charibeth Cheng

With software system complexity leading to the rise of software defects, research efforts have been done on techniques towards predicting software defects and Just-in-time (JIT) defect prediction which predicts whether a code change is defective. While using features to determine potentially defective code change, inspection effort is still significant. As code change can impact several files, we investigate an open source project to identify potential gaps with features in JIT perspective. In addition, with a lack of publicly available JIT dataset that link the features with actual commits, we also present a new dataset that can be utilized in JIT and semantic analysis.

SEOct 3, 2025
Mechanistic Interpretability of Code Correctness in LLMs via Sparse Autoencoders

Kriz Tahimic, Charibeth Cheng

As Large Language Models become integral to software development, with substantial portions of AI-suggested code entering production, understanding their internal correctness mechanisms becomes critical for safe deployment. We apply sparse autoencoders to decompose LLM representations, identifying directions that correspond to code correctness. We select predictor directions using t-statistics and steering directions through separation scores from base model representations, then analyze their mechanistic properties through steering, attention analysis, and weight orthogonalization. We find that code correctness directions in LLMs reliably predict incorrect code, while correction capabilities, though statistically significant, involve tradeoffs between fixing errors and preserving correct code. Mechanistically, successful code generation depends on attending to test cases rather than problem descriptions. Moreover, directions identified in base models retain their effectiveness after instruction-tuning, suggesting code correctness mechanisms learned during pre-training are repurposed during fine-tuning. Our mechanistic insights suggest three practical applications: prompting strategies should prioritize test examples over elaborate problem descriptions, predictor directions can serve as error alarms for developer review, and these same predictors can guide selective steering, intervening only when errors are anticipated to prevent the code corruption from constant steering.

CLNov 11, 2021
Improving Large-scale Language Models and Resources for Filipino

Jan Christian Blaise Cruz, Charibeth Cheng

In this paper, we improve on existing language resources for the low-resource Filipino language in two ways. First, we outline the construction of the TLUnified dataset, a large-scale pretraining corpus that serves as an improvement over smaller existing pretraining datasets for the language in terms of scale and topic variety. Second, we pretrain new Transformer language models following the RoBERTa pretraining technique to supplant existing models trained with small corpora. Our new RoBERTa models show significant improvements over existing Filipino models in three benchmark datasets with an average gain of 4.47% test accuracy across the three classification tasks of varying difficulty.

SESep 25, 2021
Investigation of Dataset Features for Just-in-Time Defect Prediction

Giuseppe Ng, Charibeth Cheng

Just-in-time (JIT) defect prediction refers to the technique of predicting whether a code change is defective. Many contributions have been made in this area through the excellent dataset by Kamei. In this paper, we revisit the dataset and highlight preprocessing difficulties with the dataset and the limitations of the dataset on unsupervised learning. Secondly, we propose certain features in the Kamei dataset that can be used for training models. Lastly, we discuss the limitations of the dataset's features.

CLOct 22, 2020
Exploiting News Article Structure for Automatic Corpus Generation of Entailment Datasets

Jan Christian Blaise Cruz, Jose Kristian Resabal, James Lin et al.

Transformers represent the state-of-the-art in Natural Language Processing (NLP) in recent years, proving effective even in tasks done in low-resource languages. While pretrained transformers for these languages can be made, it is challenging to measure their true performance and capacity due to the lack of hard benchmark datasets, as well as the difficulty and cost of producing them. In this paper, we present three contributions: First, we propose a methodology for automatically producing Natural Language Inference (NLI) benchmark datasets for low-resource languages using published news articles. Through this, we create and release NewsPH-NLI, the first sentence entailment benchmark dataset in the low-resource Filipino language. Second, we produce new pretrained transformers based on the ELECTRA technique to further alleviate the resource scarcity in Filipino, benchmarking them on our dataset against other commonly-used transfer learning techniques. Lastly, we perform analyses on transfer learning techniques to shed light on their true performance when operating in low-data domains through the use of degradation tests.

CLMay 5, 2020
Establishing Baselines for Text Classification in Low-Resource Languages

Jan Christian Blaise Cruz, Charibeth Cheng

While transformer-based finetuning techniques have proven effective in tasks that involve low-resource, low-data environments, a lack of properly established baselines and benchmark datasets make it hard to compare different approaches that are aimed at tackling the low-resource setting. In this work, we provide three contributions. First, we introduce two previously unreleased datasets as benchmark datasets for text classification and low-resource multilabel text classification for the low-resource language Filipino. Second, we pretrain better BERT and DistilBERT models for use within the Filipino setting. Third, we introduce a simple degradation test that benchmarks a model's resistance to performance degradation as the number of training samples are reduced. We analyze our pretrained model's degradation speeds and look towards the use of this method for comparing models aimed at operating within the low-resource setting. We release all our models and datasets for the research community to use.

CLMay 3, 2020
Simplifying Paragraph-level Question Generation via Transformer Language Models

Luis Enrico Lopez, Diane Kathryn Cruz, Jan Christian Blaise Cruz et al.

Question generation (QG) is a natural language generation task where a model is trained to ask questions corresponding to some input text. Most recent approaches frame QG as a sequence-to-sequence problem and rely on additional features and mechanisms to increase performance; however, these often increase model complexity, and can rely on auxiliary data unavailable in practical use. A single Transformer-based unidirectional language model leveraging transfer learning can be used to produce high quality questions while disposing of additional task-specific complexity. Our QG model, finetuned from GPT-2 Small, outperforms several paragraph-level QG baselines on the SQuAD dataset by 0.95 METEOR points. Human evaluators rated questions as easy to answer, relevant to their context paragraph, and corresponding well to natural human speech. Also introduced is a new set of baseline scores on the RACE dataset, which has not previously been used for QG tasks. Further experimentation with varying model capacities and datasets with non-identification type questions is recommended in order to further verify the robustness of pretrained Transformer-based LMs as question generators.

CLOct 21, 2019
Localization of Fake News Detection via Multitask Transfer Learning

Jan Christian Blaise Cruz, Julianne Agatha Tan, Charibeth Cheng

The use of the internet as a fast medium of spreading fake news reinforces the need for computational tools that combat it. Techniques that train fake news classifiers exist, but they all assume an abundance of resources including large labeled datasets and expert-curated corpora, which low-resource languages may not have. In this work, we make two main contributions: First, we alleviate resource scarcity by constructing the first expertly-curated benchmark dataset for fake news detection in Filipino, which we call "Fake News Filipino." Second, we benchmark Transfer Learning (TL) techniques and show that they can be used to train robust fake news classifiers from little data, achieving 91% accuracy on our fake news dataset, reducing the error by 14% compared to established few-shot baselines. Furthermore, lifting ideas from multitask learning, we show that augmenting transformer-based transfer techniques with auxiliary language modeling losses improves their performance by adapting to writing style. Using this, we improve TL performance by 4-6%, achieving an accuracy of 96% on our best model. Lastly, we show that our method generalizes well to different types of news articles, including political news, entertainment news, and opinion articles.

CLJun 30, 2019
Evaluating Language Model Finetuning Techniques for Low-resource Languages

Jan Christian Blaise Cruz, Charibeth Cheng

Unlike mainstream languages (such as English and French), low-resource languages often suffer from a lack of expert-annotated corpora and benchmark resources that make it hard to apply state-of-the-art techniques directly. In this paper, we alleviate this scarcity problem for the low-resourced Filipino language in two ways. First, we introduce a new benchmark language modeling dataset in Filipino which we call WikiText-TL-39. Second, we show that language model finetuning techniques such as BERT and ULMFiT can be used to consistently train robust classifiers in low-resource settings, experiencing at most a 0.0782 increase in validation error when the number of training examples is decreased from 10K to 1K while finetuning using a privately-held sentiment dataset.