LG DCJul 2, 2024

QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices

Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Yibo Zhu, Chuan Wu

arXiv:2407.02327v16.43 citationsh-index: 18Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of utilizing idle inference hardware for training in production clusters, which is an incremental improvement for hybrid device training systems.

The paper tackles the problem of synchronous distributed deep learning training across heterogeneous GPUs by proposing QSync, a system that uses quantization-minimized settings to balance efficiency and accuracy, resulting in a 0.27-1.03% accuracy improvement over uniform precision with less than 5% error in simulation.

A number of production deep learning clusters have attempted to explore inference hardware for DNN training, at the off-peak serving hours with many inference GPUs idling. Conducting DNN training with a combination of heterogeneous training and inference GPUs, known as hybrid device training, presents considerable challenges due to disparities in compute capability and significant differences in memory capacity. We propose QSync, a training system that enables efficient synchronous data-parallel DNN training over hybrid devices by strategically exploiting quantized operators. According to each device's available resource capacity, QSync selects a quantization-minimized setting for operators in the distributed DNN training graph, minimizing model accuracy degradation but keeping the training efficiency brought by quantization. We carefully design a predictor with a bi-directional mixed-precision indicator to reflect the sensitivity of DNN layers on fixed-point and floating-point low-precision operators, a replayer with a neighborhood-aware cost mapper to accurately estimate the latency of distributed hybrid mixed-precision training, and then an allocator that efficiently synchronizes workers with minimized model accuracy degradation. QSync bridges the computational graph on PyTorch to an optimized backend for quantization kernel performance and flexible support for various GPU architectures. Extensive experiments show that QSync's predictor can accurately simulate distributed mixed-precision training with <5% error, with a consistent 0.27-1.03% accuracy improvement over the from-scratch training tasks compared to uniform precision.

View on arXiv PDF Code

Similar