PFDCLGOct 23, 2025

xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads

arXiv:2510.21048v11 citationsh-index: 1Middleware
Originality Highly original
AI Analysis

This work addresses GPU scarcity and scheduling inefficiencies for deep learning practitioners in shared cluster environments, offering a non-intrusive solution with significant performance gains.

The paper tackles the problem of accurately estimating GPU memory requirements for deep learning training workloads to prevent out-of-memory errors and improve resource utilization in shared clusters. It introduces xMem, a CPU-based framework that reduces median relative error by 91% and increases memory conservation potential by 368% compared to existing methods.

The global scarcity of GPUs necessitates more sophisticated strategies for Deep Learning jobs in shared cluster environments. Accurate estimation of how much GPU memory a job will require is fundamental to enabling advanced scheduling and GPU sharing, which helps prevent out-of-memory (OOM) errors and resource underutilization. However, existing estimation methods have limitations. Approaches relying on static analysis or historical data with machine learning often fail to accurately capture runtime dynamics. Furthermore, direct GPU analysis consumes scarce resources, and some techniques require intrusive code modifications. Thus, the key challenge lies in precisely estimating dynamic memory requirements, including memory allocator nuances, without consuming GPU resources and non-intrusive code changes. To address this challenge, we propose xMem, a novel framework that leverages CPU-only dynamic analysis to accurately estimate peak GPU memory requirements a priori. We conducted a thorough evaluation of xMem against state-of-the-art solutions using workloads from 25 different models, including architectures like Convolutional Neural Networks and Transformers. The analysis of 5209 runs, which includes ANOVA and Monte Carlo results, highlights xMem's benefits: it decreases the median relative error by 91% and significantly reduces the probability of estimation failure as safe OOM thresholds by 75%, meaning that the estimated value can often be used directly without causing OOM. Ultimately, these improvements lead to a 368% increase in memory conservation potential over current solutions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes