A Workflow-Oriented Framework for Asynchronous Human-AI Collaboration in Hybrid and Compute-Intensive HPC Environments
For practitioners deploying AI in high-stakes HPC environments, this framework addresses the bottleneck of real-time human interaction by enabling asynchronous oversight.
The paper presents a workflow framework for asynchronous human-AI collaboration in HPC environments, enabling non-blocking supervision and preventing idle resources. It demonstrates benefits in portability, efficiency, and oversight for model training on systems like MareNostrum 5.
Human involvement is critical in training and deploying AI systems in high-stakes defence and security contexts. However, real-time interaction is impractical in HPC environments due to compute intensity and resource constraints. We present a workflow framework that enables asynchronous human-AI collaboration across hybrid infrastructures, including HPC clusters, local machines, and cloud platforms. Workflows can pause at defined checkpoints for human input without halting underlying compute jobs, preventing idle resources and enabling non-blocking supervision. The framework supports interaction with SLURM-based scheduling, containerized and native tasks, and is customized for scenarios requiring human judgment and adaptability. We demonstrate its application in model training on systems like MareNostrum 5, highlighting benefits in portability, efficiency, and oversight in operational AI workflows.