CLFeb 5

PACE: Defying the Scaling Hypothesis of Exploration in Iterative Alignment for Mathematical Reasoning

arXiv:2602.05370v21 citationsh-index: 11
AI Analysis

This addresses the issue of computational inefficiency and robustness in aligning large language models for mathematical reasoning, though it is incremental as it builds on existing iterative alignment methods.

The paper tackled the problem of diminishing returns and policy collapse in iterative alignment for mathematical reasoning by challenging the scaling hypothesis of exploration, and introduced PACE, which outperformed DPO-R1 with N=16 while using only about 1/5 of the compute.

Iterative Direct Preference Optimization has emerged as the state-of-the-art paradigm for aligning Large Language Models on reasoning tasks. Standard implementations (DPO-R1) rely on Best-of-N sampling (e.g., $N \ge 8$) to mine golden trajectories from the distribution tail. In this paper, we challenge this scaling hypothesis and reveal a counter-intuitive phenomenon: in mathematical reasoning, aggressive exploration yields diminishing returns and even catastrophic policy collapse. We theoretically demonstrate that scaling $N$ amplifies verifier noise and induces detrimental distribution shifts. To resolve this, we introduce \textbf{PACE} (Proximal Alignment via Corrective Exploration), which replaces brute-force mining with a generation-based corrective strategy. Operating with a minimal budget ($2<N<3$), PACE synthesizes high-fidelity preference pairs from failed explorations. Empirical evaluations show that PACE outperforms DPO-R1 $(N=16)$ while using only about $1/5$ of the compute, demonstrating superior robustness against reward hacking and label noise.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes