DCApr 14

Beyond Pre-Training: The Full Lifecycle of Foundation Models on HPC Systems

arXiv:2604.1259975.0h-index: 2
Predicted impact top 22% in DC · last 90 daysOriginality Incremental advance
AI Analysis

For national supercomputing centers, this work provides a practical blueprint to extend HPC systems beyond pre-training to support the full AI lifecycle, addressing operational conflicts with batch processing.

The paper addresses the challenge of integrating fine-tuning and inference phases of foundation models into HPC batch-processing environments, presenting a hybrid cloud-native platform at CSCS that combines diskless GPU nodes with virtualized infrastructure orchestrated by Kubernetes. Initial investigations show improved user productivity and a blueprint for enabling supercomputers to support end-to-end AI workflows.

Large-scale pre-training of Foundational Models (FM) constitutes a computationally intensive first phase for enabling AI across diverse scientific and societal applications. This first phase has positioned High-Performance Computing (HPC) facilities as indispensable backbones of "Sovereign AI" initiatives. While the massive throughput requirements of FM pre-training align with the traditional capability-oriented mission of HPC, subsequent phases of the AI lifecycle, typically referred to as fine-tuning and inference, introduce operational paradigms that can conflict with established batch-processing environments. Moreover, these phases are not computationally trivial: they often require substantial high-end compute resources while exhibiting hardware utilization patterns that differ significantly from those of pre-training. This paper addresses the architectural and strategic challenges of operationalizing a complete AI lifecycle within a national supercomputing facility. We present a hybrid cloud-native platform being developed and deployed at the Swiss National Supercomputing Centre (CSCS) that combines diskless GPU-enabled HPE Cray EX compute nodes with virtualized commodity infrastructure. Orchestrated by Kubernetes, this novel service architecture bridges the gap between HPC batch processing and service-oriented workflows. We report our initial investigations into fine-tuning pipelines and highly available inference services, analyzing the associated trade-offs while improving user productivity. Our findings offer a blueprint for enabling supercomputers to integrate "AI Factories" services and workflows, supporting AI innovations into end-to-end scientific and industrial use cases.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes