CLNov 11, 2021

Improving Large-scale Language Models and Resources for Filipino

arXiv:2111.06053v1584 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the low-resource challenge for Filipino language processing, though it is incremental as it builds on existing methods like RoBERTa.

The paper tackled the problem of limited language resources for Filipino by constructing a large-scale pretraining corpus (TLUnified) and training new RoBERTa models, resulting in an average gain of 4.47% test accuracy across three benchmark classification tasks.

In this paper, we improve on existing language resources for the low-resource Filipino language in two ways. First, we outline the construction of the TLUnified dataset, a large-scale pretraining corpus that serves as an improvement over smaller existing pretraining datasets for the language in terms of scale and topic variety. Second, we pretrain new Transformer language models following the RoBERTa pretraining technique to supplant existing models trained with small corpora. Our new RoBERTa models show significant improvements over existing Filipino models in three benchmark datasets with an average gain of 4.47% test accuracy across the three classification tasks of varying difficulty.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes