CLAIJun 3, 2025

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

arXiv:2506.03106v571 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses the problem of enhancing complex reasoning in LLMs for researchers and practitioners, though it appears incremental as it builds on existing RL methods with feedback integration.

The paper tackles performance plateaus and persistent failures in RL-finetuned LLMs by proposing Critique-GRPO, an online RL framework that integrates natural language and numerical feedback, resulting in average pass@1 score improvements of +4.4% on Qwen2.5-7B-Base and +3.8% on Qwen3-8B across eight reasoning tasks.

Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of spontaneous self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided self-refinements simultaneously while maintaining exploration. Additionally, we employ a shaping function to amplify learning from correct, especially unfamiliar, refinements and penalize incorrect ones. Extensive experiments with Qwen2.5-7B-Base, Qwen2.5-Math-7B-Base, and Qwen3-8B demonstrate that Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks. Specifically, Critique-GRPO improves average pass@1 scores across all compared methods by approximately +4.4% on Qwen2.5-7B-Base and +3.8% on Qwen3-8B. Notably, Critique-GRPO enables effective self-improvement through self-critiquing, achieving significant gains over GRPO, e.g., +16.7% pass@1 improvement on AIME 2024.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes