PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates
This addresses a specific bottleneck in reinforcement learning for improving policy training stability and efficiency, though it appears incremental as it builds on existing proximal methods.
The paper tackles the problem of controlling KL divergence in policy optimization by introducing PROMA, a reference-free proximal method that projects away high-variance gradient components. The accumulation variant achieves tighter KL control than GRPO with PPO clipping, and the intra-microbatch variant achieves the best validation performance.
This note introduces Projected Microbatch Accumulation (PROMA), a reference-free proximal policy method that controls KL divergence by projecting away high-variance components of the policy gradient. Two variants are presented. In the accumulation-based variant, the running gradient is projected orthogonal to the sequence-wise log-probability gradients of each microbatch. In the intra-microbatch variant, a factored projection using dominant subspaces of activations and gradient outputs is applied independently within each microbatch, making it compatible with standard data-parallel training. Empirically, the accumulation variant achieves tighter per-step KL control than GRPO with PPO clipping, while the intra-microbatch variant achieves the best validation performance.