Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
This work addresses the challenge of enabling language models to self-improve without external labels, which is crucial for real-world deployment, though it is an incremental advancement over existing self-improvement methods.
The paper tackles the problem of self-improving language models without labels, which often leads to over-confident solutions and diversity collapse, by proposing EVOL-RL, a framework that balances majority selection with novelty rewards. Results show significant improvements, such as increasing Qwen3-4B-Base's AIME25 pass@1 from 4.6% to 16.4% and pass@16 from 18.5% to 37.9%, while also enhancing out-of-domain generalization.
Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing self-improvement approaches primarily rely on self-confirmation signals (e.g., confidence, entropy, or consistency) to generate rewards. This reliance drives models toward over-confident, majority-favored solutions, causing an entropy collapse that degrades pass@n and reasoning complexity. To address this, we propose EVOL-RL, a label-free framework that mirrors the evolutionary principle of balancing selection with variation. Concretely, EVOL-RL retains the majority-voted answer as an anchor for stability, but adds a novelty-aware reward that scores each sampled solution by how different its reasoning is from other concurrently generated responses. This majority-for-stability + novelty-for-exploration rule mirrors the variation-selection principle: selection prevents drift, while novelty prevents collapse. Evaluation results show that EVOL-RL consistently outperforms the majority-only baseline; e.g., training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from baseline's 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. EVOL-RL not only prevents in-domain diversity collapse but also improves out-of-domain generalization (from math reasoning to broader tasks, e.g., GPQA, MMLU-Pro, and BBEH). The code is available at: https://github.com/YujunZhou/EVOL-RL.