AIFeb 5

Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning

Zhuoming Chen, Hongyi Liu, Yang Zhou, Haizhong Zheng, Beidi Chen

arXiv:2602.06107v12 citationsh-index: 6

Originality Highly original

AI Analysis

This work tackles the problem of expensive RL for LLMs, offering a more efficient training approach for researchers and practitioners working with large models.

This paper addresses the high cost of reinforcement learning (RL) for large language models (LLMs) by proposing Jackpot, a framework that decouples rollout generation from policy optimization using Optimal Budgeted Rejection Sampling (OBRS). Jackpot significantly improves training stability compared to importance-sampling baselines, achieving performance comparable to on-policy RL when training Qwen3-8B-Base for up to 300 update steps of batchsize 64.

Reinforcement learning (RL) for large language models (LLMs) remains expensive, particularly because the rollout is expensive. Decoupling rollout generation from policy optimization (e.g., leveraging a more efficient model to rollout) could enable substantial efficiency gains, yet doing so introduces a severe distribution mismatch that destabilizes learning. We propose Jackpot, a framework that leverages Optimal Budget Rejection Sampling (OBRS) to directly reduce the discrepancy between the rollout model and the evolving policy. Jackpot integrates a principled OBRS procedure, a unified training objective that jointly updates the policy and rollout models, and an efficient system implementation enabled by top-$k$ probability estimation and batch-level bias correction. Our theoretical analysis shows that OBRS consistently moves the rollout distribution closer to the target distribution under a controllable acceptance budget. Empirically, \sys substantially improves training stability compared to importance-sampling baselines, achieving performance comparable to on-policy RL when training Qwen3-8B-Base for up to 300 update steps of batchsize 64. Taken together, our results show that OBRS-based alignment brings us a step closer to practical and effective decoupling of rollout generation from policy optimization for RL for LLMs.

View on arXiv PDF

Similar