CLLGNov 21, 2023

A Baseline Analysis of Reward Models' Ability To Accurately Analyze Foundation Models Under Distribution Shift

arXiv:2311.14743v718 citationsh-index: 11
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of ensuring reliable alignment of large language models for users by highlighting vulnerabilities in reward models under distribution shift, but it is incremental as it builds on existing OOD detection techniques.

The study evaluated the robustness of reward models in Reinforcement Learning with Human Feedback (RLHF) to distribution shifts, finding novel calibration patterns and accuracy drops, with reward models being more sensitive to shifts in responses than prompts.

Foundation models, specifically Large Language Models (LLMs), have lately gained wide-spread attention and adoption. Reinforcement Learning with Human Feedback (RLHF) involves training a reward model to capture desired behaviors, which is then used to align LLM's. These reward models are additionally used at inference-time to estimate LLM responses' adherence to those desired behaviors. However, there is little work measuring how robust these reward models are to distribution shifts. In this work, we evaluate how reward model performance - measured via accuracy and calibration (i.e. alignment between accuracy and confidence) - is affected by distribution shift. We show novel calibration patterns and accuracy drops due to OOD prompts and responses, and that the reward model is more sensitive to shifts in responses than prompts. Additionally, we adapt an OOD detection technique commonly used in classification to the reward model setting to detect these distribution shifts in prompts and responses.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes