CLAIOct 15, 2024

RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals

arXiv:2410.11348v33 citationsh-index: 9ICML
Originality Incremental advance
AI Analysis

This addresses the issue of black-box reward models in LLM alignment for researchers and practitioners, though it is incremental as it builds on existing causal estimation techniques.

The paper tackles the problem of understanding what reward models reward by developing RATE, a method to measure the causal effect of high-level attributes like sentiment on reward models, and shows it is an effective estimator.

Reward models are widely used as proxies for human preferences when aligning or evaluating LLMs. However, reward models are black boxes, and it is often unclear what, exactly, they are actually rewarding. In this paper we develop Rewrite-based Attribute Treatment Estimator (RATE) as an effective method for measuring the sensitivity of a reward model to high-level attributes of responses, such as sentiment, helpfulness, or complexity. Importantly, RATE measures the causal effect of an attribute on the reward. RATE uses LLMs to rewrite responses to produce imperfect counterfactuals examples that can be used to measure causal effects. A key challenge is that these rewrites are imperfect in a manner that can induce substantial bias in the estimated sensitivity of the reward model to the attribute. The core idea of RATE is to adjust for this imperfect-rewrite effect by rewriting twice. We establish the validity of the RATE procedure and show empirically that it is an effective estimator.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes