LG AINov 25, 2024

Interpreting Language Reward Models via Contrastive Explanations

Junqi Jiang, Tom Bewley, Saumitra Mishra, Freddy Lecue, Manuela Veloso

arXiv:2411.16502v212.59 citationsh-index: 7ICLR

Originality Incremental advance

AI Analysis

This work addresses the need for more transparent RMs to improve trust in LLM alignment, but it is incremental as it builds on existing explanation techniques applied to a specific domain.

The authors tackled the problem of interpreting black-box reward models (RMs) used in aligning large language models with human values by proposing a method using contrastive explanations to explain binary response comparisons, and they validated its effectiveness in quantitative experiments and demonstrated qualitative usefulness for analyzing RM sensitivity and behaviors.

Reward models (RMs) are a crucial component in the alignment of large language models' (LLMs) outputs with human values. RMs approximate human preferences over possible LLM responses to the same prompt by predicting and comparing reward scores. However, as they are typically modified versions of LLMs with scalar output heads, RMs are large black boxes whose predictions are not explainable. More transparent RMs would enable improved trust in the alignment of LLMs. In this work, we propose to use contrastive explanations to explain any binary response comparison made by an RM. Specifically, we generate a diverse set of new comparisons similar to the original one to characterise the RM's local behaviour. The perturbed responses forming the new comparisons are generated to explicitly modify manually specified high-level evaluation attributes, on which analyses of RM behaviour are grounded. In quantitative experiments, we validate the effectiveness of our method for finding high-quality contrastive explanations. We then showcase the qualitative usefulness of our method for investigating global sensitivity of RMs to each evaluation attribute, and demonstrate how representative examples can be automatically extracted to explain and compare behaviours of different RMs. We see our method as a flexible framework for RM explanation, providing a basis for more interpretable and trustworthy LLM alignment.

View on arXiv PDF

Similar