LGAIJan 16, 2025

Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment

arXiv:2501.09620v236 citationsh-index: 10
AI Analysis

This addresses biases in LLM alignment for users relying on RLHF, offering a practical drop-in enhancement to improve trustworthiness and fairness.

The paper tackles the problem of spurious correlations in reward modeling for large language model alignment, proposing a causal reward modeling approach that mitigates biases like length bias and sycophancy, resulting in more reliable and fair alignment.

Recent advances in large language models (LLMs) have demonstrated significant progress in performing complex tasks. While Reinforcement Learning from Human Feedback (RLHF) has been effective in aligning LLMs with human preferences, it is susceptible to spurious correlations in reward modeling. Consequently, it often introduces biases-such as length bias, sycophancy, conceptual bias, and discrimination-that hinder the model's ability to capture true causal relationships. To address this, we propose a novel causal reward modeling approach that integrates causality to mitigate these spurious correlations. Our method enforces counterfactual invariance, ensuring reward predictions remain consistent when irrelevant variables are altered. Through experiments on both synthetic and real-world datasets, we show that our approach mitigates various types of spurious correlations effectively, resulting in more reliable and fair alignment of LLMs with human preferences. As a drop-in enhancement to the existing RLHF workflow, our causal reward modeling provides a practical way to improve the trustworthiness and fairness of LLM finetuning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes