AIOct 29, 2025

Zero Reinforcement Learning Towards General Domains

Yuyuan Zeng, Yufei Huang, Can Xu, Qingfeng Sun, Jianfeng Yan, Guanghui Xu, Tao Yang, Fengzong Lian

arXiv:2510.25528v13 citationsh-index: 8

Originality Incremental advance

AI Analysis

This work addresses a gap in zero-RL for more diverse, real-world scenarios where reward verification is difficult, offering a method to improve reasoning in general domains, though it appears incremental as it builds on existing zero-RL techniques.

The paper tackles the challenge of applying zero reinforcement learning (Zero-RL) to domains without easily verifiable rewards, proposing a novel paradigm that combines verifiable rewards with a generative reward model and a smooth length penalty to enhance reasoning across both verifiable and non-verifiable tasks, achieving superior performance on Qwen3-8B-Base and Qwen3-14B-Base models.

Zero Reinforcement Learning (Zero-RL) has proven to be an effective approach for enhancing the reasoning capabilities of large language models (LLMs) by directly applying reinforcement learning with verifiable rewards on pretrained models, without the need for a supervised fine-tuning phase. However, current research on zero-RL primarily focuses on domains with easily verifiable reward signals, such as mathematics, programming, and other reasoning tasks. The challenge of eliciting reasoning abilities in more diverse scenarios, where verification is not straightforward, remains underexplored. To address this gap, we propose a novel zero-RL paradigm designed to improve a model's reasoning ability across both verifiable and non-verifiable domains. By combining verifiable rewards with a generative reward model, we conduct multi-task zero-RL training across both domains, facilitating the transfer of reasoning capabilities between them. Furthermore, to mitigate reward hacking in the generative reward model, we design a smooth length penalty that encourages the generation of more comprehensive thinking tokens in general domains. Experimental results on Qwen3-8B-Base and Qwen3-14B-Base demonstrate that our approach achieves superior reasoning performance, not only on tasks requiring extensive reasoning but also on more general tasks.

View on arXiv PDF

Similar