CLSep 26, 2025

Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

Chi Ruan, Dongfu Jiang, Yubo Wang, Wenhu Chen

arXiv:2509.22824v110.96 citationsh-index: 13

Originality Incremental advance

AI Analysis

This work addresses the need for better reasoning and critique abilities in large language models for coding and general tasks, though it is incremental as it builds on existing critique-based methods.

The paper tackles the problem of enhancing coder models by introducing Critique Reinforcement Learning (CRL), which trains models to generate critiques for question-solution pairs, and shows that integrating CRL with standard RL improves performance, with Critique-Coder-8B achieving over 60% on LiveCodeBench (v5) and outperforming other models like DeepCoder-14B and GPT-o1.

Reinforcement Learning (RL) has emerged as a popular training paradigm, particularly when paired with reasoning models. While effective, it primarily focuses on generating responses and lacks mechanisms to explicitly foster critique or reflection. Several recent studies, like Critique-Fine-Tuning (CFT) and Critique-Guided-Distillation (CGD) have shown the benefits of explicitly teaching LLMs how to critique. Motivated by them, we propose Critique Reinforcement Learning (CRL), where the model is tasked with generating a critique for a given (question, solution) pair. The reward is determined solely by whether the final judgment label $c \in \{\texttt{True}, \texttt{False}\}$ of the generated critique aligns with the ground-truth judgment $c^*$. Building on this point, we introduce \textsc{Critique-Coder}, which is trained on a hybrid of RL and CRL by substituting 20\% of the standard RL data with CRL data. We fine-tune multiple models (\textsc{Critique-Coder}) and evaluate them on different benchmarks to show their advantages over RL-only models. We show that \textsc{Critique-Coder} consistently outperforms RL-only baselines on all the evaluated benchmarks. Notably, our \textsc{Critique-Coder-8B} can reach over 60\% on LiveCodeBench (v5), outperforming other reasoning models like DeepCoder-14B and GPT-o1. Beyond code generation, \textsc{Critique-Coder} also demonstrates enhanced general reasoning abilities, as evidenced by its better performance on logic reasoning tasks from the BBEH dataset. This indicates that the application of CRL on coding datasets enhances general reasoning and critique abilities, which are transferable across a broad range of tasks. Hence, we believe that CRL works as a great complement to standard RL for LLM reasoning.

View on arXiv PDF

Similar