AIOct 23, 2024

Process Supervision-Guided Policy Optimization for Code Generation

arXiv:2410.17621v227 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses a bottleneck in improving code generation for developers and AI systems, though it is incremental as it builds on existing RL methods.

The paper tackles the problem of sparse rewards in reinforcement learning for code generation by introducing a Process Reward Model (PRM) that provides dense, line-level feedback during code generation, which significantly boosts performance, especially in long-horizon scenarios.

Reinforcement learning (RL) with unit test feedback has enhanced large language models' (LLMs) code generation, but relies on sparse rewards provided only after complete code evaluation, limiting learning efficiency and incremental improvements. When generated code fails all unit tests, no learning signal is received, hindering progress on complex tasks. To address this, we propose a Process Reward Model (PRM) that delivers dense, line-level feedback on code correctness during generation, mimicking human code refinement and providing immediate guidance. We explore various strategies for training PRMs and integrating them into the RL framework, finding that using PRMs both as dense rewards and for value function initialization significantly boosts performance. Our experimental results also highlight the effectiveness of PRMs in enhancing RL-driven code generation, especially for long-horizon scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes