CL LGMay 29

Consolidating Rewarded Perturbations for LLM Post-Training

arXiv:2605.3149492.6

Predicted impact top 22% in CL · last 90 daysOriginality Highly original

AI Analysis

This work addresses the problem of efficiently incorporating post-training improvements into a single deployable language model, which is important for practical applications of LLMs.

This paper introduces CoRP, a gradient-free method for consolidating rewarded perturbations into a single language model. CoRP improves the base model by 8.1 points on average across various models and tasks, and it outperforms single-inference RandOpt by 6.5 points while using 10x less perturbation budget.

Post-training of language models is commonly framed as a sample-score-update loop implemented by gradient descent. A recent line of work, exemplified by RandOpt, relocates this loop to weight space, sampling Gaussian perturbations around a pretrained model and ensembling the top-K rewarded specialists at inference. While competitive with PPO and GRPO under matched training compute, this prediction-level ensemble incurs K forward passes per test example and does not extend cleanly to free-form generation. We ask whether the rewarded population can instead be folded into a single deployable model, replacing the inference-time ensemble with one consolidated update. A split-half analysis over 25 model-task pairs reveals reproducible low-rank structure in every case. We turn this geometry into CoRP (Consolidating Rewarded Perturbations), a gradient-free operator that combines reward-weighted aggregation, compatibility-aware reweighting, and a held-out validation gate, with no gradient flowing through the language model. Across five language models from 0.5B to 8B and five tasks covering math, code, and creative writing, CoRP improves the base model by 8.1 points on average. Using one tenth of RandOpt's perturbation budget, CoRP exceeds single-inference RandOpt by 6.5 points and recovers more than half of the gain of the 50-pass majority-vote ensemble, at one forward pass per test example.

View on arXiv PDF

Similar