LG AI GTJan 16, 2025

Clone-Robust AI Alignment

Ariel D. Procaccia, Benjamin Schiffer, Shirley Zhang

arXiv:2501.09254v115.76 citationsh-index: 62ICML

Originality Incremental advance

AI Analysis

This work addresses a specific challenge in AI alignment for LLM developers, but it is incremental as it builds on existing RLHF methods with a novel robustness property.

The paper tackles the problem of aligning Large Language Models with human preferences using Reinforcement Learning with Human Feedback (RLHF), addressing the issue of unbalanced datasets by introducing robustness to approximate clones. The result is a new algorithm, weighted MLE, that guarantees this robustness while maintaining theoretical properties, in contrast to the standard MLE algorithm which fails to satisfy it.

A key challenge in training Large Language Models (LLMs) is properly aligning them with human preferences. Reinforcement Learning with Human Feedback (RLHF) uses pairwise comparisons from human annotators to train reward functions and has emerged as a popular alignment method. However, input datasets in RLHF are not necessarily balanced in the types of questions and answers that are included. Therefore, we want RLHF algorithms to perform well even when the set of alternatives is not uniformly distributed. Drawing on insights from social choice theory, we introduce robustness to approximate clones, a desirable property of RLHF algorithms which requires that adding near-duplicate alternatives does not significantly change the learned reward function. We first demonstrate that the standard RLHF algorithm based on regularized maximum likelihood estimation (MLE) fails to satisfy this property. We then propose the weighted MLE, a new RLHF algorithm that modifies the standard regularized MLE by weighting alternatives based on their similarity to other alternatives. This new algorithm guarantees robustness to approximate clones while preserving desirable theoretical properties.

View on arXiv PDF

Similar