AISep 7, 2025

Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL

Haoyang He, Zihua Rong, Kun Ji, Chenyang Li, Qing Huang, Chong Xia, Lan Yang, Honggang Zhang

arXiv:2509.06024v17.83 citationsh-index: 4Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of enhancing formal reasoning skills in large language models for tasks like logical deduction and mathematics, representing an incremental improvement over existing RL methods.

The authors tackled the problem of improving reasoning quality in large language models by proposing a reinforcement learning framework called Dynamic Reasoning Efficiency Reward (DRER), which enhances Chain-of-Thought reasoning through fine-grained rewards and dynamic length advantages, resulting in a 7B model achieving GPT-o3-mini level performance on a new benchmark with a 30% increase in answer confidence.

Reinforcement learning (RL) has recently become the dominant paradigm for strengthening the reasoning abilities of large language models (LLMs). Yet the rule-based reward functions commonly used on mathematical or programming benchmarks assess only answer format and correctness, providing no signal as to whether the induced Chain-of-Thought (CoT) actually improves the answer. Furthermore, such task-specific training offers limited control over logical depth and therefore may fail to reveal a model's genuine reasoning capacity. We propose Dynamic Reasoning Efficiency Reward (DRER) -- a plug-and-play RL reward framework that reshapes both reward and advantage signals. (i) A Reasoning Quality Reward assigns fine-grained credit to those reasoning chains that demonstrably raise the likelihood of the correct answer, directly incentivising the trajectories with beneficial CoT tokens. (ii) A Dynamic Length Advantage decays the advantage of responses whose length deviates from a validation-derived threshold, stabilising training. To facilitate rigorous assessment, we also release Logictree, a dynamically constructed deductive reasoning dataset that functions both as RL training data and as a comprehensive benchmark. Experiments confirm the effectiveness of DRER: our 7B model attains GPT-o3-mini level performance on Logictree with 400 trianing steps, while the average confidence of CoT-augmented answers rises by 30%. The model further exhibits generalisation across diverse logical-reasoning datasets, and the mathematical benchmark AIME24. These results illuminate how RL shapes CoT behaviour and chart a practical path toward enhancing formal-reasoning skills in large language models. All code and data are available in repository https://github.com/Henryhe09/DRER.

View on arXiv PDF Code

Similar