ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates
This addresses inefficiencies in fine-tuning LLMs for researchers and practitioners by reducing GPU idle time and PCIe bottlenecks, though it is an incremental improvement over existing offloading methods.
The paper tackles the problem of GPU stalls in offloaded training of large language models by introducing ZenFlow, a framework that prioritizes important parameters and decouples updates, achieving up to 5x speedup, 2x lower PCIe traffic, and over 85% reduction in GPU stalls while preserving accuracy.
Fine-tuning large language models (LLMs) often exceeds GPU memory limits, prompting systems to offload model states to CPU memory. However, existing offloaded training frameworks like ZeRO-Offload treat all parameters equally and update the full model on the CPU, causing severe GPU stalls, where fast, expensive GPUs sit idle waiting for slow CPU updates and limited-bandwidth PCIe transfers. We present ZenFlow, a new offloading framework that prioritizes important parameters and decouples updates between GPU and CPU. ZenFlow performs in-place updates of important gradients on GPU, while asynchronously offloading and accumulating less important ones on CPU, fully overlapping CPU work with GPU computation. To scale across GPUs, ZenFlow introduces a lightweight gradient selection method that exploits a novel spatial and temporal locality property of important gradients, avoiding costly global synchronization. ZenFlow achieves up to 5x end-to-end speedup, 2x lower PCIe traffic, and reduces GPU stalls by over 85 percent, all while preserving accuracy.