LGCLMay 7

RVPO: Risk-Sensitive Alignment via Variance Regularization

arXiv:2605.0575031.8h-index: 2
Predicted impact top 8% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners of multi-objective alignment, RVPO provides a simple fix to constraint neglect without sacrificing general capabilities.

RVPO introduces a variance penalty to multi-reward RLHF, preventing models from neglecting difficult constraints. It achieves 0.261 vs. 0.215 on HealthBench (14B, p<0.001) and avoids late-stage degradation on GPQA-Diamond.

Current critic-less RLHF methods aggregate multi-objective rewards via an arithmetic mean, leaving them vulnerable to constraint neglect: high-magnitude success in one objective can numerically offset critical failures in others (e.g., safety or formatting), masking low-performing "bottleneck" rewards vital for reliable multi-objective alignment. We propose Reward-Variance Policy Optimization (RVPO), a risk-sensitive framework that penalizes inter-reward variance during advantage aggregation, shifting the objective from "maximize sum" to "maximize consistency." We show via Taylor expansion that a LogSumExp (SoftMin) operator effectively acts as a smooth variance penalty. We evaluate RVPO on rubric-based medical and scientific reasoning with up to 17 concurrent LLM-judged reward signals (Qwen2.5-3B/7B/14B) and on tool-calling with rule-based constraints (Qwen2.5-1.5B/3B). By preventing the model from neglecting difficult constraints to exploit easier objectives, RVPO improves overall scores on HealthBench (0.261 vs. 0.215 for GDPO at 14B, $p < 0.001$) and maintains competitive accuracy on GPQA-Diamond without the late-stage degradation observed in other multi-reward methods, demonstrating that variance regularization mitigates constraint neglect across model scales without sacrificing general capabilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes