Strategyproof Reinforcement Learning from Human Feedback
This addresses a critical issue in RLHF for ensuring truthful feedback from multiple labelers, with implications for AI alignment and fairness, though it is incremental in proposing a new algorithm within an established framework.
The paper tackles the problem of strategic misreporting by labelers in Reinforcement Learning from Human Feedback (RLHF), showing that existing algorithms are not strategyproof and can lead to large misalignment, while proving a fundamental trade-off between incentive and policy alignment. It proposes the Pessimistic Median of MLEs algorithm, which is approximately strategyproof and converges to the optimal policy under certain assumptions.
We study Reinforcement Learning from Human Feedback (RLHF) in settings where multiple labelers may strategically misreport feedback to steer the learned policy toward their own preferences. We show that existing RLHF algorithms, including recent pluralistic methods, are not strategyproof, and that even a single strategic labeler can cause arbitrarily large misalignment with social welfare. Moreover, we prove that, in the worst case, any strategyproof RLHF algorithm must perform $k$-times worse than the optimal policy, where $k$ is the number of labelers. This suggests a fundamental trade-off between incentive alignment (ensuring labelers report truthfully) and policy alignment (maximizing social welfare). To address this, we propose the Pessimistic Median of MLEs algorithm, which, under appropriate policy coverage assumptions, is approximately strategyproof and converges to the optimal policy as the number of labelers and samples increases. Our results apply to both contextual bandits and Markov decision processes.