PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management
This system makes pre-trained model training more accessible by reducing hardware barriers, though it is incremental as it builds on existing memory optimization techniques.
The paper tackles the high hardware requirements for training pre-trained models by proposing PatrickStar, a system that uses chunk-based memory management in CPU-GPU heterogeneous memory, enabling training of larger models like 175B GPT3 on a 32 GPU cluster and achieving 2.27-2.5 times model scale extension and higher speed compared to DeepSpeed.
The pre-trained model (PTM) is revolutionizing Artificial Intelligence (AI) technology. However, the hardware requirement of PTM training is prohibitively high, making it a game for a small proportion of people. Therefore, we proposed PatrickStar system to lower the hardware requirements of PTMs and make them accessible to everyone. PatrickStar uses the CPU-GPU heterogeneous memory space to store the model data. Different from existing works, we organize the model data in memory chunks and dynamically distribute them in the heterogeneous memory. Guided by the runtime memory statistics collected in a warm-up iteration, chunks are orchestrated efficiently in heterogeneous memory and generate lower CPU-GPU data transmission volume and higher bandwidth utilization. Symbiosis with the Zero Redundancy Optimizer, PatrickStar scales to multiple GPUs on multiple nodes. % using data parallelism. The system can train tasks on bigger models and larger batch sizes, which cannot be accomplished by existing works. Experimental results show that PatrickStar extends model scales 2.27 and 2.5 times of DeepSpeed, and consistently exhibits significantly higher execution speed. PatricStar also successfully runs the 175B GPT3 training task on a 32 GPU cluster. Our code is publicly available at https://github.com/Tencent/PatrickStar.