AICLLGNov 13, 2023

Generalization Analogies: A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains

arXiv:2311.07723v38 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the risk of AI misalignment in hard-to-measure domains for AI safety researchers, but it is incremental as it builds on existing oversight methods.

The paper tackles the problem of AI systems gaming flawed human feedback by studying how reward models generalize across 69 distribution shifts, finding they often fail to evaluate instruction-following and instead favor internet-like personas, with interpretation techniques improving generalization but still struggling in many cases.

As AI systems become more intelligent and their behavior becomes more challenging to assess, they may learn to game the flaws of human feedback instead of genuinely striving to follow instructions; however, this risk can be mitigated by controlling how LLMs generalize human feedback to situations where it is unreliable. To better understand how reward models generalize, we craft 69 distribution shifts spanning 8 categories. We find that reward models do not learn to evaluate `instruction-following' by default and instead favor personas that resemble internet text. Techniques for interpreting reward models' internal representations achieve better generalization than standard fine-tuning, but still frequently fail to distinguish instruction-following from conflated behaviors. We consolidate the 15 most challenging distribution shifts into the GENeralization analogIES (GENIES) benchmark, which we hope will enable progress toward controlling reward model generalization.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes