CLApr 2, 2024

Poro 34B and the Blessing of Multilinguality

arXiv:2404.01856v324 citationsh-index: 50Has CodeNoDaLiDa/Baltic-HLT
Originality Incremental advance
AI Analysis

This addresses the data scarcity issue for low-resource languages like Finnish, offering a practical approach to improve model performance beyond monolingual training, though it is incremental in applying multilingual methods to specific domains.

The study tackled the problem of limited training data for most languages by proposing multilingual pretraining as a solution, introducing Poro 34B, a 34B parameter model trained on Finnish, English, and programming languages, which substantially advances capabilities for Finnish and achieves competitive performance in English and programming tasks.

The pretraining of state-of-the-art large language models now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on individual large languages. We believe that multilinguality can be a blessing: when the lack of training data is a constraint for effectively training larger models for a target language, augmenting the dataset with other languages can offer a way to improve over the capabilities of monolingual models for that language. In this study, we introduce Poro 34B, a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages, and demonstrate that a multilingual training approach can produce a model that substantially advances over the capabilities of existing models for Finnish and excels in translation, while also achieving competitive performance in its class for English and programming languages. We release the model parameters, scripts, and data under open licenses at https://huggingface.co/LumiOpen/Poro-34B.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes