LG AIOct 18, 2023

Improving Generalization of Alignment with Human Preferences through Group Invariant Learning

Rui Zheng, Wei Shen, Yuan Hua, Wenbin Lai, Shihan Dou, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Haoran Huang, Tao Gui, Qi Zhang, Xuanjing Huang

arXiv:2310.11971v319.226 citationsh-index: 40

Originality Incremental advance

AI Analysis

This work addresses the issue of inconsistent performance across domains for AI assistants, which is an incremental improvement over existing RLHF methods.

The paper tackles the problem of poor generalization in Reinforcement Learning from Human Feedback (RLHF) for language models by proposing a method that automatically groups data to maximize performance variance and optimizes policies for challenging groups, resulting in significantly enhanced training stability and model generalization.

The success of AI assistants based on language models (LLMs) hinges crucially on Reinforcement Learning from Human Feedback (RLHF), which enables the generation of responses more aligned with human preferences. As universal AI assistants, there's a growing expectation for them to perform consistently across various domains. However, previous work shows that Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples. This focus on quick reward gains undermines both the stability in training and the model's ability to generalize to new, unseen data. In this work, we propose a novel approach that can learn a consistent policy via RL across various data groups or domains. Given the challenges associated with acquiring group annotations, our method automatically classifies data into different groups, deliberately maximizing performance variance. Then, we optimize the policy to perform well on challenging groups. Lastly, leveraging the established groups, our approach adaptively adjusts the exploration space, allocating more learning capacity to more challenging data and preventing the model from over-optimizing on simpler data. Experimental results indicate that our approach significantly enhances training stability and model generalization.

View on arXiv PDF

Similar