CL AI LGSep 17, 2025

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

Alejandro Hernández-Cano, Alexander Hägele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard Frank Ďurech, Ido Hakimi, Juan García Giraldo, Mete Ismayilzada

ETH Zurich

arXiv:2509.14233v126.319 citationsh-index: 16

Originality Incremental advance

AI Analysis

This addresses the problem of legal and ethical compliance in open LLMs for global users, though it is incremental in improving existing open models.

The paper tackles the lack of data compliance and multilingual representation in open large language models by introducing Apertus, a suite of models trained on openly available data with enhanced multilingual coverage, achieving state-of-the-art results among fully open models on multilingual benchmarks.

We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.

View on arXiv PDF

Similar