SEAICLFeb 3, 2025

ACECODER: Acing Coder RL via Automated Test-Case Synthesis

arXiv:2502.01718v482 citationsh-index: 13ACL
Originality Incremental advance
AI Analysis

This work addresses the problem of enhancing code generation models for developers and researchers, though it is incremental as it builds on existing RL and test-case methods in a new domain.

The paper tackled the challenge of limited reliable reward data for reinforcement learning in code models by using automated large-scale test-case synthesis to train reward models and conduct RL, resulting in improvements such as a 10-point gain for Llama-3.1-8B-Ins and over 25% improvement on HumanEval-plus with only 80 optimization steps.

Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training. Specifically, we design a pipeline that generates extensive (question, test-cases) pairs from existing code data. Using these test cases, we construct preference pairs based on pass rates over sampled programs to train reward models with Bradley-Terry loss. It shows an average of 10-point improvement for Llama-3.1-8B-Ins and 5-point improvement for Qwen2.5-Coder-7B-Ins through best-of-32 sampling, making the 7B model on par with 236B DeepSeek-V2.5. Furthermore, we conduct reinforcement learning with both reward models and test-case pass rewards, leading to consistent improvements across HumanEval, MBPP, BigCodeBench, and LiveCodeBench (V4). Notably, we follow the R1-style training to start from Qwen2.5-Coder-base directly and show that our RL training can improve model on HumanEval-plus by over 25\% and MBPP-plus by 6\% for merely 80 optimization steps. We believe our results highlight the huge potential of reinforcement learning in coder models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes