Capturing Individual Human Preferences with Reward Features
This addresses the challenge of personalizing AI systems in contexts with high disagreement, such as large language model training, though it is an incremental improvement over existing adaptive methods.
The paper tackles the problem of modeling individual human preferences in reinforcement learning from human feedback by proposing a method to specialize reward models to specific people or groups, showing that it either significantly outperforms or matches baseline models in experiments with large language models.
Reinforcement learning from human feedback usually models preferences using a reward model that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We propose a method to specialise a reward model to a person or group of people. Our approach builds on the observation that individual preferences can be captured as a linear combination of a set of general reward features. We show how to learn such features and subsequently use them to quickly adapt the reward model to a specific individual, even if their preferences are not reflected in the training data. We present experiments with large language models comparing the proposed architecture with a non-adaptive reward model and also adaptive counterparts, including models that do in-context personalisation. Depending on how much disagreement there is in the training data, our model either significantly outperforms the baselines or matches their performance with a simpler architecture and more stable training.