LG AIMar 4

Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization

Haodong Zhu, Yangyang Ren, Yanjing Li, Mingbao Lin, Linlin Yang, Xuhui Liu, Xiantong Zhen, Haiguang Liu, Baochang Zhang

arXiv:2603.04135v11.41 citationsh-index: 44

Originality Highly original

AI Analysis

This paper offers a significant speedup for large language model (LLM) training, specifically for methods like GRPO that use group-based policy optimization, benefiting researchers and practitioners working on efficient LLM reasoning.

The paper tackles the computational cost of Group Relative Policy Optimization (GRPO) for LLM reasoning, which is caused by extensive group-based sampling. They propose Dynamic Pruning Policy Optimization (DPPO), which uses importance sampling to enable dynamic pruning while preserving unbiased gradient estimation. DPPO achieves a 2.37x training speedup and improves average accuracy by 3.36% on Qwen3-4B trained on MATH.

Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs due to its extensive group-based sampling requirement. While recent selective data utilization methods can mitigate this overhead, they could induce estimation bias by altering the underlying sampling distribution, compromising theoretical rigor and convergence behavior. To address this limitation, we propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation through importance sampling-based correction. By incorporating mathematically derived rescaling factors, DPPO significantly accelerates GRPO training without altering the optimization objective of the full-batch baseline. Furthermore, to mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy that maximizes valid token density and hardware utilization. Extensive experiments demonstrate that DPPO consistently accelerates training across diverse models and benchmarks. For instance, on Qwen3-4B trained on MATH, DPPO achieves 2.37$\times$ training speedup and outperforms GRPO by 3.36% in average accuracy across six mathematical reasoning benchmarks.

View on arXiv PDF

Similar