LGAICLApr 19

Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

arXiv:2605.0522690.5h-index: 2
Predicted impact top 7% in LG · last 90 daysOriginality Highly original
AI Analysis

This work addresses the credit assignment problem in reinforcement learning for reasoning tasks, offering a scalable alternative to expensive process supervision.

The paper proposes a method for reinforcement learning in reasoning that internalizes outcome supervision into process supervision, enabling the model to automatically extract process-level learning signals from failed reasoning trajectories. The approach achieves finer-grained policy optimization under outcome-only supervision without costly external process supervision.

The central challenge of reinforcement learning for reasoning lies not only in the sparsity of outcome-level supervision, but more fundamentally in how to transform feedback provided only at the end of a sequence into fine-grained learning signals that can guide intermediate reasoning steps. Existing approaches either rely on outcome-level rewards for sequence-level optimization, which makes precise credit assignment difficult, or depend on externally constructed process supervision, which is costly and difficult to scale sustainably. To address this, we propose a new perspective: reinforcement learning for reasoning can be understood as the problem of internalizing outcome supervision into process supervision. From this perspective, we introduce a supervision-internalization method for reinforcement learning for reasoning, enabling the model to automatically extract process-level learning signals through identifying, correcting, and reusing failed reasoning trajectories, thereby achieving finer-grained policy optimization under outcome-only supervision. We further abstract this idea into a new training paradigm, in which the model continually generates and refines its own internal process supervision during reinforcement learning, opening a new path for fine-grained credit assignment in reinforcement learning for reasoning that differs from externally provided process supervision.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes