LGJun 4

SALT: When More Rollouts Don't Help in Group-Based Policy Optimization and How to Make Them Matter

arXiv:2606.0580070.3
Predicted impact top 37% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners using group-based policy optimization in RL with verifiable rewards, SALT provides a plug-in fix to a fundamental cancellation issue without changing reward models or sampling.

GRPO-style group-relative updates in RLVR suffer from signed cancellation that weakens learning as rollouts increase. SALT reweights group-relative coefficients using gradient geometry, improving effective updates and performance across reasoning benchmarks.

Reinforcement learning with verifiable rewards (RLVR) often adopts GRPO-style group-relative updates, sampling multiple rollouts per prompt to construct normalized learning signals. However, merely increasing the number of rollouts does not reliably strengthen learning: under GRPO-style group normalization, per-rollout policy-gradient features can concentrate into a low-rank, signed geometry, causing substantial cancellation during aggregation and weakening the effective update. We address this failure mode with SALT, a Subspace-Adaptive geometry pLug-in componenT that uses sample-wise gradient geometry to reweight the coefficients of group-relative updates. SALT estimates a dominant shared subspace from the mini-batch Gram geometry, decomposes group-relative coefficients into shared and residual channels, and adaptively amplifies the residual channel when signed cancellation is severe. Across diverse reasoning-oriented RLVR benchmarks and model scales, SALT improves effective update geometry and performance without modifying the reward model or the rollout sampling procedure

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes