PFLGFeb 11, 2025

Memory Analysis on the Training Course of DeepSeek Models

arXiv:2502.07846v12 citationsh-index: 3
Originality Synthesis-oriented
AI Analysis

This provides insights for researchers and engineers optimizing memory in distributed training of large models, but it is incremental as it focuses on theoretical analysis without new methods.

The paper analyzes GPU memory consumption during training of DeepSeek models, examining factors like micro-batch size and parallelism to clarify device-level memory requirements in large-scale mixture-of-experts models.

We present a theoretical analysis of GPU memory consumption during the training of DeepSeek models such as DeepSeek-v2 and DeepSeek-v3. Our primary objective is to clarify the device-level memory requirements associated with various distributed training configurations. Specifically, we examine critical factors influencing memory usage, including micro-batch size, activation recomputation policies, 3D parallelism, and ZeRO optimizations. It is important to emphasize that the training policies discussed in this report are not representative of DeepSeek's official configurations. Instead, they are explored to provide a deeper understanding of memory dynamics in training of large-scale mixture-of-experts model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes