DC LG PFAug 26, 2025

CARMA: Collocation-Aware Resource Manager

Ehsan Yousefzadeh-Asl-Miandoab, Reza Karimzadeh, Bulat Ibragimov, Florina M. Ciorba, Pınar Tözün

arXiv:2508.19073v21.2h-index: 13

Originality Incremental advance

AI Analysis

This addresses the problem of inefficient GPU resource management for server-scale deep learning workloads, offering incremental improvements in collocation strategies.

The paper tackles the problem of underutilized GPUs in deep learning by proposing CARMA, a collocation-aware resource management system that improves GPU utilization and reduces execution time and energy consumption, achieving a 54% increase in SM utilization, a 61% increase in parallelism per SM, and a 62% increase in memory use, leading to ~35% reduction in makespan and ~15% reduction in GPU energy consumption.

GPUs running deep learning (DL) workloads are frequently underutilized. Collocating multiple DL training tasks on the same GPU can improve utilization but introduces two key risks: (1) out-of-memory (OOM) crashes for newly scheduled tasks, and (2) severe performance interference among co-running tasks, which can negate any throughput gains. These issues reduce system robustness, quality of service, and energy efficiency. We present CARMA, a task-level, collocation-aware resource management system for the server-scale. CARMA addresses collocation challenges via (1) fine-grained monitoring and bookkeeping of GPUs and a collocation risk analysis that filters out the high-risk GPUs; (2) task placement policies that cap GPU utilization to avoid OOMs and limit interference; (3) integration of GPU memory need estimators for DL tasks to minimize OOMs during collocation; and (4) a lightweight recovery method that relaunches jobs crashed due to OOMs. Our evaluation on a DL training workload derived from real-world traces shows that CARMA uses GPUs more efficiently by making more informed collocation decisions: for the best-performing collocation policy, CARMA increases GPU streaming multiprocessor (SM) utilization by 54%, the parallelism achieved per SM by 61%, and memory use by 62%. This results in a $\sim$35% and $\sim$15% reduction in the end-to-end execution time (makespan) and GPU energy consumption, respectively, for this workload.

View on arXiv PDF

Similar