LG AIDec 18, 2024

Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes

Katarzyna Kobalczyk, Claudio Fanconi, Hao Sun, Mihaela van der Schaar

arXiv:2412.13998v19.23 citationsh-index: 74Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of heterogeneous human preferences in LLM alignment for users, though it appears incremental as it builds on existing preference modeling with novel implementation.

The paper tackles the challenge of aligning large language models with diverse individual user preferences by developing a few-shot steerable alignment framework that infers preferences from small choice samples, enabling LLMs to adapt to individual preferences at inference time and generate outputs across behavioral modes.

As large language models (LLMs) become increasingly embedded in everyday applications, ensuring their alignment with the diverse preferences of individual users has become a critical challenge. Currently deployed approaches typically assume homogeneous user objectives and rely on single-objective fine-tuning. However, human preferences are inherently heterogeneous, influenced by various unobservable factors, leading to conflicting signals in preference data. Existing solutions addressing this diversity often require costly datasets labelled for specific objectives and involve training multiple reward models or LLM policies, which is computationally expensive and impractical. In this work, we present a novel framework for few-shot steerable alignment, where users' underlying preferences are inferred from a small sample of their choices. To achieve this, we extend the Bradley-Terry-Luce model to handle heterogeneous preferences with unobserved variability factors and propose its practical implementation for reward modelling and LLM fine-tuning. Thanks to our proposed approach of functional parameter-space conditioning, LLMs trained with our framework can be adapted to individual preferences at inference time, generating outputs over a continuum of behavioural modes. We empirically validate the effectiveness of methods, demonstrating their ability to capture and align with diverse human preferences in a data-efficient manner. Our code is made available at: https://github.com/kasia-kobalczyk/few-shot-steerable-alignment.

View on arXiv PDF Code

Similar