CLJul 18, 2024

FuLG: 150B Romanian Corpus for Language Model Pretraining

Vlad-Andrei Bădoiu, Mihai-Valentin Dumitru, Alexandru M. Gherghescu, Alexandru Agache, Costin Raiciu

arXiv:2407.13657v12 citationsh-index: 31

Originality Synthesis-oriented

AI Analysis

This addresses the problem of data scarcity for Romanian language model development, but it is incremental as it applies an existing method to new data.

The authors tackled the lack of large-scale pretraining data for underrepresented languages by introducing FuLG, a 150-billion-token Romanian corpus extracted from CommonCrawl, and they compared it to existing Romanian corpora through ablation studies.

Research in the field of language models is rapidly evolving, with many open models being released to the public. Openly available pretraining corpora usually focus on only a handful of languages, with many others either missing completely or extremely underrepresented. In this report, we introduce FuLG, a hundred-fifty-billion-token Romanian corpus extracted from CommonCrawl. We present our methodology for filtering FuLG and compare it via ablation studies against existing Romanian corpora.

View on arXiv PDF

Similar