CLApr 8, 2024

Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining

arXiv:2404.05428v179 citationsh-index: 12SIGUL
Originality Incremental advance
AI Analysis

This work addresses the need for affordable encoder models for scientific communities working with closely-related languages, offering an incremental improvement in model development efficiency.

The paper tackled the problem of cost-efficient development of encoder models for closely-related languages (Croatian, Serbian, Bosnian, Montenegrin) by comparing training from scratch with additional pretraining of multilingual models, showing that additional pretraining achieves comparable performance with limited computation and can include neighboring languages like Slovenian without significant loss.

The world of language models is going through turbulent times, better and ever larger models are coming out at an unprecedented speed. However, we argue that, especially for the scientific community, encoder models of up to 1 billion parameters are still very much needed, their primary usage being in enriching large collections of data with metadata necessary for downstream research. We investigate the best way to ensure the existence of such encoder models on the set of very closely related languages - Croatian, Serbian, Bosnian and Montenegrin, by setting up a diverse benchmark for these languages, and comparing the trained-from-scratch models with the new models constructed via additional pretraining of existing multilingual models. We show that comparable performance to dedicated from-scratch models can be obtained by additionally pretraining available multilingual models even with a limited amount of computation. We also show that neighboring languages, in our case Slovenian, can be included in the additional pretraining with little to no loss in the performance of the final model.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes