CLNov 11, 2021

Improving Large-scale Language Models and Resources for Filipino

Jan Christian Blaise Cruz, Charibeth Cheng

arXiv:2111.06053v129.6584 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the low-resource challenge for Filipino language processing, though it is incremental as it builds on existing methods like RoBERTa.

The paper tackled the problem of limited language resources for Filipino by constructing a large-scale pretraining corpus (TLUnified) and training new RoBERTa models, resulting in an average gain of 4.47% test accuracy across three benchmark classification tasks.

In this paper, we improve on existing language resources for the low-resource Filipino language in two ways. First, we outline the construction of the TLUnified dataset, a large-scale pretraining corpus that serves as an improvement over smaller existing pretraining datasets for the language in terms of scale and topic variety. Second, we pretrain new Transformer language models following the RoBERTa pretraining technique to supplant existing models trained with small corpora. Our new RoBERTa models show significant improvements over existing Filipino models in three benchmark datasets with an average gain of 4.47% test accuracy across the three classification tasks of varying difficulty.

View on arXiv PDF

Similar