LG AIMay 5, 2025

Quantitative Analysis of Performance Drop in DeepSeek Model Quantization

Enbo Zhao, Yi Shen, Shuming Shi, Jieyun Huang, Zhihao Chen, Ning Wang, Siqi Xiao, Jian Zhang, Kai Wang, Shiguo Lian

arXiv:2505.02390v27.12 citationsh-index: 8Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of deploying large models like DeepSeek-R1 and V3 locally for organizations with privacy or service availability concerns, though it is incremental as it builds on existing quantization techniques.

The study tackled the performance drop of DeepSeek models after quantization for local deployment, finding that 4-bit quantization maintains minimal degradation and enabling single-machine use on standard GPUs, with a proposed dynamic 3-bit method (DQ3_K_M) outperforming traditional variants and matching 4-bit performance in most tasks.

Recently, there is a high demand for deploying DeepSeek-R1 and V3 locally, possibly because the official service often suffers from being busy and some organizations have data privacy concerns. While single-machine deployment offers infrastructure simplicity, the models' 671B FP8 parameter configuration exceeds the practical memory limits of a standard 8-GPU machine. Quantization is a widely used technique that helps reduce model memory consumption. However, it is unclear what the performance of DeepSeek-R1 and V3 will be after being quantized. This technical report presents the first quantitative evaluation of multi-bitwidth quantization across the complete DeepSeek model spectrum. Key findings reveal that 4-bit quantization maintains little performance degradation versus FP8 while enabling single-machine deployment on standard NVIDIA GPU devices. We further propose DQ3_K_M, a dynamic 3-bit quantization method that significantly outperforms traditional Q3_K_M variant on various benchmarks, which is also comparable with 4-bit quantization (Q4_K_M) approach in most tasks. Moreover, DQ3_K_M supports single-machine deployment configurations for both NVIDIA H100/A100 and Huawei 910B. Our implementation of DQ3\_K\_M is released at https://github.com/UnicomAI/DeepSeek-Eval, containing optimized 3-bit quantized variants of both DeepSeek-R1 and DeepSeek-V3.

View on arXiv PDF Code

Similar