AI LGOct 22, 2024

ICPL: Few-shot In-context Preference Learning via LLMs

Chao Yu, Qixin Tan, Hong Lu, Jiaxuan Gao, Xinting Yang, Yu Wang, Yi Wu, Eugene Vinitsky

arXiv:2410.17233v312.57 citationsh-index: 20

Originality Highly original

AI Analysis

This addresses the problem of inefficient human querying in preference learning for reinforcement learning tasks, offering a novel approach that is incremental in applying LLMs to an existing bottleneck.

The paper tackles the inefficiency of preference-based reinforcement learning by proposing In-Context Preference Learning (ICPL), which leverages large language models to achieve sample-efficient preference learning, significantly outperforming baseline methods with higher performance and orders of magnitude greater efficiency in synthetic and human-in-the-loop trials.

Preference-based reinforcement learning is an effective way to handle tasks where rewards are hard to specify but can be exceedingly inefficient as preference learning is often tabula rasa. We demonstrate that Large Language Models (LLMs) have native preference-learning capabilities that allow them to achieve sample-efficient preference learning, addressing this challenge. We propose In-Context Preference Learning (ICPL), which uses in-context learning capabilities of LLMs to reduce human query inefficiency. ICPL uses the task description and basic environment code to create sets of reward functions which are iteratively refined by placing human feedback over videos of the resultant policies into the context of an LLM and then requesting better rewards. We first demonstrate ICPL's effectiveness through a synthetic preference study, providing quantitative evidence that it significantly outperforms baseline preference-based methods with much higher performance and orders of magnitude greater efficiency. We observe that these improvements are not solely coming from LLM grounding in the task but that the quality of the rewards improves over time, indicating preference learning capabilities. Additionally, we perform a series of real human preference-learning trials and observe that ICPL extends beyond synthetic settings and can work effectively with humans-in-the-loop.

View on arXiv PDF

Similar