AIS: Adaptive Importance Sampling for Quantized RL

Jiajun Zhou, Wei Shao, Lingchao Zheng, Yuwei Fan, Ngai Wong

arXiv:2605.1390775.3

Predicted impact top 4% in ML · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners deploying quantized RL for LLMs, this solves a practical bottleneck (rollout-training mismatch) that previously caused training collapse, enabling efficient low-precision training without performance loss.

The paper addresses the rollout-training mismatch in quantized RL for LLMs, where FP8 rollouts with BF16 training cause non-stationary bias that destabilizes training. AIS adaptively corrects this bias per batch, matching BF16 performance while retaining 1.5-2.76x rollout speedup.

Reinforcement learning (RL) for large language models (LLMs) is dominated by the cost of rollout generation, which has motivated the use of low-precision rollouts (e.g., FP8) paired with a BF16 trainer to improve throughput and reduce memory pressure. This introduces a rollout-training mismatch that biases the policy gradient and can cause training to collapse outright on reasoning benchmarks. We show that the mismatch is non-stationary and acts as a double-edged sword: early in training it provides a stochastic exploration bonus, exposing the gradient to trajectories the trainer would otherwise under-sample, but the same perturbation transitions into a destabilizing source of bias as the policy concentrates. To solve this, we propose Adaptive Importance Sampling (AIS), a correction framework that adjusts the strength of its intervention on a per-batch basis. AIS combines three real-time diagnostics, namely weight reliability, divergence severity, and variance amplification, into a single mixing coefficient that interpolates between the uncorrected and fully importance-weighted gradients, suppressing the destabilizing component of the mismatch while preserving its exploratory benefit. We integrate AIS into GRPO and evaluate it on the diffusion-based LLaDA-8B-Instruct and the autoregressive Qwen3-8B and Qwen3.5-9B across mathematical reasoning and planning benchmarks. AIS matches the BF16 baseline on most tasks while retaining the 1.5 to 2.76x rollout speedup of FP8.

View on arXiv PDF

Similar