Fawzi Roberto Mohamed

DC
h-index16
3papers
19citations
Novelty43%
AI Score43

3 Papers

52.4DCApr 15
An Engineering Journey Training Large Language Models at Scale on Alps: The Apertus Experience

Jonathan Coles, Stefano Schuppli, Lukas Drescher et al.

Large Language Models (LLMs) have surged as a transformative technology for science and society, prompting governments worldwide to pursue sovereign AI capabilities that ensure data compliance and cultural representation. However, the associated capital costs and engineering complexity required to train these models have largely restricted such capabilities to the private sector, leaving a significant gap for public institutions. This paper details the engineering journey behind training Apertus, a fully open multilingual foundation model, on the Alps supercomputer. Representing a first-of-its-kind achievement for academia at the 70B parameter scale, we successfully deployed a massive pre-training campaign on one of Europe's largest systems for open science, powered by NVIDIA GH200 Grace Hopper Superchips. We detail the challenges encountered in readying HPC infrastructure for training AI models, from overcoming storage bottlenecks to stabilizing large-scale interconnects, and the lessons learned in transforming a supercomputer into a resilient software-defined Machine Learning Platform. Finally, we discuss the post-training requirements and evolution of our Machine Learning platform, outlining how this initial release lays the groundwork for a sustained, iterative operational capability, in particular for fine tuning foundation models, that extends well beyond a single model training run.

3.5DCApr 18
Sarus Suite: Cloud-native Containers for HPC

Alberto Madonna, Matteo Chesi, Gwangmu Lee et al.

High-performance computing (HPC) systems must support fast-moving software stacks, especially in AI/ML, while preserving scheduler control, scalable startup, and production performance. Yet many HPC container solutions rely on specialized runtime stacks that weaken continuity with mainstream cloud-native workflows and require ongoing effort to sustain compatibility with the evolving upstream ecosystem. We argue that HPC should specialize the integration layer while keeping the container engine aligned with upstream container evolution. We present Sarus Suite, an upstream-aligned HPC container architecture built around an unchanged Podman engine. Sarus Suite adds the HPC-specific functionality needed for production use through complementary system layers for declarative runtime specification, scheduler-native execution, scalable shared-image access, and standards-based host capability injection. We evaluate Sarus Suite on a Cray EX GH200 system using communication-intensive HPC workloads, large scale AI training, metadata-heavy startup workloads, and container startup measurements. Across PyFR, SPH-EXA, Megatron-LM, and Pynamic, Sarus Suite matches the performance and scaling of the production Enroot+Pyxis baseline while delivering consistently faster per-node container startup. The architecture also enables direct use of upstream OCI images, including NGC-based images, and supports cloud-native multi-container workflows expressed through Kubernetes manifests. These results show that HPC-grade containers do not require an HPC-specific runtime, provided that scheduler semantics, scalable image access, and host integration are implemented in explicit system layers. This preserves upstream continuity and software agility while maintaining scheduler control, scalability, and production performance.

CLSep 17, 2025
Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

Alejandro Hernández-Cano, Alexander Hägele, Allen Hao Huang et al. · eth-zurich

We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.