Beyond Verifiable Rewards: Scaling Reinforcement Learning for Language Models to Unverifiable Data
This addresses the challenge of training language models on unverifiable data, which is incremental as it builds on prior RL methods but extends them to less structured domains.
The paper tackles the problem of scaling reinforcement learning for language models to unverifiable data, such as long-form answers like mathematical proofs, by proposing JEPO, a novel algorithm that applies Jensen's evidence lower bound. Results show JEPO is as effective as RL with verifiable rewards on math data, improves on semi-verifiable data, and outperforms baselines on unverifiable data in likelihood evaluations.
We propose to scale RL to unverifiable data with a novel algorithm JEPO (Jensen's Evidence lower bound Policy Optimization). While most prior efforts on scaling RL for LLMs focus on verifiable data where ground truth answers are typically short-form and can be matched easily; we investigate the case where such assumptions are less valid (e.g., when answers are long-form such as mathematical proofs). To scale RL training to unverifiable data with contemporary training constraints, we propose JEPO. JEPO applies Jensen's evidence lower bound, a pragmatic simplification of the evidence lower bound which views chain-of-thought as a latent variable in the generative process. We show that on verifiable data (math), JEPO is as effective as RL with verifiable rewards; on semi-verifiable data (numina), JEPO improves on soft-match based evaluations compared to RL with verifiable rewards which can only leverage a subset of the data source; finally, on unverifiable data (numina-proof), JEPO outperforms SFT and a few ablation baselines on likelihood evaluations.