AIAug 29, 2025

Know When to Explore: Difficulty-Aware Certainty as a Guide for LLM Reinforcement Learning

Ang Li, Zhihang Yuan, Yang Zhang, Shouda Liu, Yisen Wang

arXiv:2509.00125v118.110 citationsh-index: 7

Originality Incremental advance

AI Analysis

This work addresses a key bottleneck in improving reasoning abilities for LLMs, offering a domain-specific incremental advance.

The paper tackles the problem of sparse rewards in reinforcement learning for large language models by introducing DACE, an algorithm that uses difficulty-aware certainty to balance exploration and exploitation, achieving higher accuracy on mathematical reasoning benchmarks like AIME and MATH.

Reinforcement Learning with Verifiable Feedback (RLVF) has become a key technique for enhancing the reasoning abilities of Large Language Models (LLMs). However, its reliance on sparse, outcome based rewards, which only indicate if a final answer is correct or not, fails to provide granular guidance on the reasoning process itself. This limitation hinders efficient learning, as the model cannot distinguish between high quality and inefficient solutions, nor can it learn effectively from different types of failures. To address this, we observe that an LLMs self-certainty often correlates with task difficulty and solution quality. We introduce Difficulty Aware Certainty guided Exploration (DACE), a novel RL algorithm that leverages this insight to dynamically balance the exploration exploitation trade-off. DACE assesses task difficulty online based on the policys success rate. It then uses this signal to modulate an intrinsic reward: for difficult tasks where the model is struggling, DACE encourages exploration by penalizing high certainty; for easier tasks, it encourages learning efficiency by rewarding high certainty. Experiments on challenging mathematical reasoning benchmarks (AIME, MATH) show that DACE significantly outperforms strong baselines. The DACE-trained models not only achieve higher accuracy but also demonstrate more robust performance when scaling test-time compute, validating that our adaptive approach fosters effective exploration without sacrificing precision.

View on arXiv PDF

Similar