CLSep 24, 2024

EuroLLM: Multilingual Language Models for Europe

Pedro Henrique Martins, Patrick Fernandes, João Alves, Nuno M. Guerreiro, Ricardo Rei, Duarte M. Alves, José Pombal, Amin Farajian, Manuel Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow

Meta AI

arXiv:2409.16235v124.6102 citationsh-index: 45

Originality Synthesis-oriented

AI Analysis

This addresses the lack of high-quality multilingual LLMs for European languages, though it is incremental as it builds on existing LLM methods.

The EuroLLM project developed open-weight multilingual language models for European languages, achieving competitive performance on multilingual benchmarks and machine translation tasks.

The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.

View on arXiv PDF

Similar