DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding
This work is significant for developers of large language models, as it offers a method to handle diverse human preferences and reduce risks associated with proxy over-optimization without requiring model retraining.
This paper addresses the issue of heterogeneous human preferences in preference-based alignment methods by proposing Disagreement-Aware Alignment via Risk-Constrained Decoding (DARC). DARC is a retraining-free, inference-time method that reranks candidates using a KL-robust satisfaction objective, reducing disagreement and tail risk while maintaining competitive average quality.
Preference-based alignment methods (e.g., RLHF, DPO) typically optimize a single scalar objective, implicitly averaging over heterogeneous human preferences. In practice, systematic annotator and user-group disagreement makes mean-reward maximization brittle and susceptible to proxy over-optimization. We propose **Disagreement-Aware Alignment via Risk-Constrained Decoding (DARC)**, a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making. Given multiple preference samples or scalable disagreement proxies, DARC reranks candidates by maximizing a *KL-robust (entropic)* satisfaction objective, and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining. We provide theoretical characterization linking this decoding rule to principled pessimism and KL-based distributionally robust optimization. Experiments on alignment benchmarks show that DARC reduces disagreement and tail risk while maintaining competitive average quality under noisy, heterogeneous feedback.