CLFeb 5

PACE: Defying the Scaling Hypothesis of Exploration in Iterative Alignment for Mathematical Reasoning

Jun Rao, Zixiong Yu, Xuebo Liu, Guhan Chen, Jing Li, Jiansheng Wei, Xiaojun Meng, Min Zhang

arXiv:2602.05370v21.11 citationsh-index: 11

Originality Incremental advance

AI Analysis

This addresses the issue of computational inefficiency and robustness in aligning large language models for mathematical reasoning, though it is incremental as it builds on existing iterative alignment methods.

The paper tackled the problem of diminishing returns and policy collapse in iterative alignment for mathematical reasoning by challenging the scaling hypothesis of exploration, and introduced PACE, which outperformed DPO-R1 with N=16 while using only about 1/5 of the compute.

Iterative Direct Preference Optimization has emerged as the state-of-the-art paradigm for aligning Large Language Models on reasoning tasks. Standard implementations (DPO-R1) rely on Best-of-N sampling (e.g., $N \ge 8$) to mine golden trajectories from the distribution tail. In this paper, we challenge this scaling hypothesis and reveal a counter-intuitive phenomenon: in mathematical reasoning, aggressive exploration yields diminishing returns and even catastrophic policy collapse. We theoretically demonstrate that scaling $N$ amplifies verifier noise and induces detrimental distribution shifts. To resolve this, we introduce \textbf{PACE} (Proximal Alignment via Corrective Exploration), which replaces brute-force mining with a generation-based corrective strategy. Operating with a minimal budget ($2<N<3$), PACE synthesizes high-fidelity preference pairs from failed explorations. Empirical evaluations show that PACE outperforms DPO-R1 $(N=16)$ while using only about $1/5$ of the compute, demonstrating superior robustness against reward hacking and label noise.

View on arXiv PDF

Similar