DCAIDBPFNov 26, 2025

Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM

arXiv:2511.21413v11 citationsh-index: 11Proceedings of the 1st International Workshop on Next-Gen Middleware for MLOps in Distributed Systems
Originality Synthesis-oriented
AI Analysis

This addresses the problem of inefficient AI inference scaling on HPC systems for users in higher education, representing an incremental improvement by combining existing tools.

The paper tackles the challenge of adapting High-Performance Computing (HPC) infrastructure for dynamic AI inference workloads by integrating vLLM, Slurm, and Kubernetes on a supercomputer, achieving efficient scaling for up to 1000 concurrent requests with only about 500 ms of latency overhead.

Due to rising demands for Artificial Inteligence (AI) inference, especially in higher education, novel solutions utilising existing infrastructure are emerging. The utilisation of High-Performance Computing (HPC) has become a prevalent approach for the implementation of such solutions. However, the classical operating model of HPC does not adapt well to the requirements of synchronous, user-facing dynamic AI application workloads. In this paper, we propose our solution that serves LLMs by integrating vLLM, Slurm and Kubernetes on the supercomputer \textit{RAMSES}. The initial benchmark indicates that the proposed architecture scales efficiently for 100, 500 and 1000 concurrent requests, incurring only an overhead of approximately 500 ms in terms of end-to-end latency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes