CLFeb 20, 2025

Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning

arXiv:2502.14356v115 citationsh-index: 32ACL
Originality Incremental advance
AI Analysis

This addresses a bottleneck in mathematical reasoning for language models, though it is incremental as it builds on existing DPO methods.

The paper tackles the challenge of Direct Preference Optimization (DPO) struggling with long-chain mathematical reasoning by proposing Full-Step-DPO, which uses step-wise rewards from the entire reasoning chain and a self-supervised process reward model, achieving superior performance on mathematical reasoning benchmarks.

Direct Preference Optimization (DPO) often struggles with long-chain mathematical reasoning. Existing approaches, such as Step-DPO, typically improve this by focusing on the first erroneous step in the reasoning chain. However, they overlook all other steps and rely heavily on humans or GPT-4 to identify erroneous steps. To address these issues, we propose Full-Step-DPO, a novel DPO framework tailored for mathematical reasoning. Instead of optimizing only the first erroneous step, it leverages step-wise rewards from the entire reasoning chain. This is achieved by training a self-supervised process reward model, which automatically scores each step, providing rewards while avoiding reliance on external signals. Furthermore, we introduce a novel step-wise DPO loss, which dynamically updates gradients based on these step-wise rewards. This endows stronger reasoning capabilities to language models. Extensive evaluations on both in-domain and out-of-domain mathematical reasoning benchmarks across various base language models, demonstrate that Full-Step-DPO achieves superior performance compared to state-of-the-art baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes