LG CLMay 7

RVPO: Risk-Sensitive Alignment via Variance Regularization

Ivan Montero, Tomasz Jurczyk, Bhuwan Dhingra

arXiv:2605.0575031.8h-index: 2

Predicted impact top 8% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners of multi-objective alignment, RVPO provides a simple fix to constraint neglect without sacrificing general capabilities.

RVPO introduces a variance penalty to multi-reward RLHF, preventing models from neglecting difficult constraints. It achieves 0.261 vs. 0.215 on HealthBench (14B, p<0.001) and avoids late-stage degradation on GPQA-Diamond.

Current critic-less RLHF methods aggregate multi-objective rewards via an arithmetic mean, leaving them vulnerable to constraint neglect: high-magnitude success in one objective can numerically offset critical failures in others (e.g., safety or formatting), masking low-performing "bottleneck" rewards vital for reliable multi-objective alignment. We propose Reward-Variance Policy Optimization (RVPO), a risk-sensitive framework that penalizes inter-reward variance during advantage aggregation, shifting the objective from "maximize sum" to "maximize consistency." We show via Taylor expansion that a LogSumExp (SoftMin) operator effectively acts as a smooth variance penalty. We evaluate RVPO on rubric-based medical and scientific reasoning with up to 17 concurrent LLM-judged reward signals (Qwen2.5-3B/7B/14B) and on tool-calling with rule-based constraints (Qwen2.5-1.5B/3B). By preventing the model from neglecting difficult constraints to exploit easier objectives, RVPO improves overall scores on HealthBench (0.261 vs. 0.215 for GDPO at 14B, $p < 0.001$) and maintains competitive accuracy on GPQA-Diamond without the late-stage degradation observed in other multi-reward methods, demonstrating that variance regularization mitigates constraint neglect across model scales without sacrificing general capabilities.

View on arXiv PDF

Similar