CLAILGSep 17, 2025

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

ETH Zurich
arXiv:2509.14233v119 citationsh-index: 16
Originality Incremental advance
AI Analysis

This addresses the problem of legal and ethical compliance in open LLMs for global users, though it is incremental in improving existing open models.

The paper tackles the lack of data compliance and multilingual representation in open large language models by introducing Apertus, a suite of models trained on openly available data with enhanced multilingual coverage, achieving state-of-the-art results among fully open models on multilingual benchmarks.

We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes