EuroLLM-22B: Technical Report
This addresses the problem of underserved European languages for European citizens, though it is incremental as it applies existing methods to new data.
The paper tackles the underrepresentation of European languages in open large language models by developing EuroLLM-22B, a model trained from scratch to support all 24 official EU languages and 11 additional languages, achieving competitive performance on multilingual benchmarks.
This report presents EuroLLM-22B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-22B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. Across a broad set of multilingual benchmarks, EuroLLM-22B demonstrates strong performance in reasoning, instruction following, and translation, achieving results competitive with models of comparable size. To support future research, we release our base and instruction-tuned models, our multilingual web pretraining data and updated EuroBlocks instruction datasets, as well as our pre-training and evaluation codebases.