LGOct 26, 2025

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Yuyang Ding, Chi Zhang, Juntao Li, Haibin Lin, Xin Liu, Min Zhang

arXiv:2510.22543v14 citationsh-index: 12

Originality Incremental advance

AI Analysis

This addresses a specific bottleneck in RL for reasoning in LLMs, offering an incremental improvement for researchers and practitioners in AI and natural language processing.

The paper tackles the problem of flawed-positive rollouts in reinforcement learning for reasoning with large language models, where incorrect reasoning patterns are inadvertently rewarded, and proposes FAPO, a flawed-aware policy optimization method that improves outcome correctness, process reliability, and training stability without increasing token usage.

Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models (LLMs). In this context, models explore reasoning trajectories and exploit rollouts with correct answers as positive signals for policy optimization. However, these rollouts might involve flawed patterns such as answer-guessing and jump-in-reasoning. Such flawed-positive rollouts are rewarded identically to fully correct ones, causing policy models to internalize these unreliable reasoning patterns. In this work, we first conduct a systematic study of flawed-positive rollouts in RL and find that they enable rapid capability gains during the early optimization stage, while constraining reasoning capability later by reinforcing unreliable patterns. Building on these insights, we propose Flawed-Aware Policy Optimization (FAPO), which presents a parameter-free reward penalty for flawed-positive rollouts, enabling the policy to leverage them as useful shortcuts in the warm-up stage, securing stable early gains, while gradually shifting optimization toward reliable reasoning in the later refinement stage. To accurately and comprehensively detect flawed-positive rollouts, we introduce a generative reward model (GenRM) with a process-level reward that precisely localizes reasoning errors. Experiments show that FAPO is effective in broad domains, improving outcome correctness, process reliability, and training stability without increasing the token budget.

View on arXiv PDF

Similar