CLAug 16, 2022

BERTifying Sinhala -- A Comprehensive Analysis of Pre-trained Language Models for Sinhala Text Classification

Vinura Dhananjaya, Piyumal Demotte, Surangika Ranathunga, Sanath Jayasena

arXiv:2208.07864v23.428 citationsh-index: 14

Originality Synthesis-oriented

AI Analysis

This work addresses the need for effective text classification tools for Sinhala language users, providing benchmarks and resources, though it is incremental as it applies existing methods to a new language context.

This research tackled the problem of evaluating pre-trained language models for Sinhala text classification, finding that XLM-R outperformed other multilingual models and that newly pre-trained monolingual Sinhala RoBERTa models set strong baselines, with robustness in low-data scenarios.

This research provides the first comprehensive analysis of the performance of pre-trained language models for Sinhala text classification. We test on a set of different Sinhala text classification tasks and our analysis shows that out of the pre-trained multilingual models that include Sinhala (XLM-R, LaBSE, and LASER), XLM-R is the best model by far for Sinhala text classification. We also pre-train two RoBERTa-based monolingual Sinhala models, which are far superior to the existing pre-trained language models for Sinhala. We show that when fine-tuned, these pre-trained language models set a very strong baseline for Sinhala text classification and are robust in situations where labeled data is insufficient for fine-tuning. We further provide a set of recommendations for using pre-trained models for Sinhala text classification. We also introduce new annotated datasets useful for future research in Sinhala text classification and publicly release our pre-trained models.

View on arXiv PDF

Similar