LGApr 8, 2025

Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, Yatao Bian

arXiv:2504.05812v344.3113 citationsh-index: 9Has Code

Originality Incremental advance

AI Analysis

This addresses the need for unsupervised reasoning improvement in LLMs, offering a novel approach that is incremental but reduces dependency on labeled data.

The paper tackles the problem of enhancing reasoning in large language models without external supervision, proposing a fully unsupervised method that improves accuracy on mathematical and natural reasoning benchmarks, e.g., boosting Qwen2.5-Math-7B from 30.7% to 48.1%.

Existing methods to enhance the reasoning capability of large language models predominantly rely on supervised fine-tuning (SFT) followed by reinforcement learning (RL) on reasoning-specific data. These approaches critically depend on external supervisions--such as labeled reasoning traces, verified golden answers, or pre-trained reward models. In this work, we propose Entropy Minimized Policy Optimization (\ours), which makes an early attempt at fully unsupervised LLM reasoning incentivization. By continuously minimizing the predictive entropy of LLMs on unlabeled questions in a latent semantic space, \ours achieves competitive performance compared to supervised counterparts on both mathematical and free-form natural reasoning tasks. Specifically, without any supervised signals, \ours boosts the accuracy of Qwen2.5-Math-7B Base from 30.7\% to 48.1\% on mathematical benchmarks and improves the accuracy of Qwen2.5-7B Base from 32.1\% to 50.1\% on MMLU-Pro. Primary experiments and analysis are also provided to interpret the effectiveness of \ours. Code is available at https://github.com/QingyangZhang/EMPO.

View on arXiv PDF Code

Similar