LG AI CLJan 15, 2025

RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

Kaiqu Liang, Haimin Hu, Ryan Liu, Thomas L. Griffiths, Jaime Fernández Fisac

Princeton

arXiv:2501.08617v323.319 citationsh-index: 11

Originality Highly original

AI Analysis

This addresses misalignment in generative AI for applications like consultancy and recommendation systems, offering a novel method to enhance alignment generalization beyond incremental improvements.

The paper tackles systematic misalignment in Reinforcement Learning from Human Feedback (RLHF) by proposing Reinforcement Learning from Hindsight Simulation (RLHS), which conditions evaluator feedback on simulated outcomes to decouple alignment signals from compromised predictions, resulting in substantial improvements in alignment across multiple settings and benchmarks.

While Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning generative AI, we present empirical evidence that it can also cause severe, systematic misalignment. We hypothesize that this stems from evaluator feedback depending on downstream outcome predictions (foresight) that can be influenced by the AI's output, inducing Goodhart's law dynamics. We present a theoretical analysis showing that conditioning evaluator feedback on downstream observations (hindsight) inhibits this effect by decoupling the alignment signal from potentially compromised predictions--crucially, the result holds even if the observed outcomes are sampled from the AI's own world model. Building on this insight, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which presents plausible simulated outcomes to evaluators before eliciting feedback. We validate RLHS across three consultancy settings--marketplace interactions, restaurant recommendations, and online course advising--using both online (PPO) and offline (DPO) fine-tuning methods, and show that it substantially improves alignment over RLHF in experiments and human evaluations. We perform post-hoc benchmark evaluations on TruthfulQA, HaluEval, and TrustLLM, finding that even after single-task fine-tuning, RLHF misalignment persists, whereas RLHS consistently outperforms baselines and demonstrates robust alignment generalization. The project webpage and code are available at https://rl-hindsight.github.io.

View on arXiv PDF

Similar