DC AI LGSep 2, 2025

MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall

Avinash Maurya, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae

arXiv:2509.02480v14.34 citationsh-index: 7SC

Originality Incremental advance

AI Analysis

This addresses the GPU memory wall issue for LLM training, offering a domain-specific optimization that is incremental over existing offloading techniques.

The paper tackles the problem of I/O bottlenecks in offloading for large language model (LLM) pre-training on resource-constrained GPUs, proposing MLP-Offload, which achieves 2.5× faster iterations compared to state-of-the-art methods.

Training LLMs larger than the aggregated memory of multiple GPUs is increasingly necessary due to the faster growth of LLM sizes compared to GPU memory. To this end, multi-tier host memory or disk offloading techniques are proposed by state of art. Despite advanced asynchronous multi-tier read/write strategies, such offloading strategies result in significant I/O overheads in the critical path of training, resulting in slower iterations. To this end, we propose MLP-Offload, a novel multi-level, multi-path offloading engine specifically designed for optimizing LLM training on resource-constrained setups by mitigating I/O bottlenecks. We make several key observations that drive the design of MLP-Offload, such as I/O overheads during the update dominate the iteration time; I/O bandwidth of the third-level remote storage tier remains unutilized; and, contention due to concurrent offloading amplifies I/O bottlenecks. Driven by these insights, we design and implement MLP-Offload to offload the optimizer states across multiple tiers in a cache-efficient and concurrency-controlled fashion to mitigate I/O bottlenecks during the backward and update phases. Evaluations on models up to 280B parameters shows that MLP-Offload achieves 2.5$\times$ faster iterations compared to the state-of-the-art LLM training runtimes.

View on arXiv PDF

Similar