CL AI LGSep 30, 2024

Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs

Mehdi Ali, Michael Fromm, Klaudia Thellmann, Jan Ebert, Alexander Arno Weber, Richard Rutmann, Charvi Jain, Max Lübbering, Daniel Steinigen, Johannes Leveling, Katrin Klug, Jasper Schulze Buschhoff

arXiv:2410.03730v311.218 citationsh-index: 28

Originality Incremental advance

AI Analysis

This addresses the need for more inclusive AI tools in Europe by providing LLMs that better handle linguistic diversity, though it is incremental as it builds on existing model architectures.

The authors tackled the problem of limited multilingual support in large language models by developing Teuken-7B-Base and Teuken-7B-Instruct, which support all 24 official EU languages and show strong performance on multilingual benchmarks like European ARC, HellaSwag, and TruthfulQA.

We present two multilingual LLMs, Teuken 7B-base and Teuken 7B-instruct, designed to embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. We detail the models' development principles, i.e., data composition, tokenizer optimization, and training methodologies. The models demonstrate strong performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag, and TruthfulQA.

View on arXiv PDF

Similar