LGAIMLJul 21, 2023

Kernelized Offline Contextual Dueling Bandits

CMU
arXiv:2307.11288v15 citationsh-index: 58
Originality Incremental advance
AI Analysis

This work addresses the challenge of high-cost human feedback in applications like reinforcement learning for large language models, offering a more efficient method for policy identification, though it is incremental as it builds on existing dueling bandit and contextual bandit concepts.

The paper tackles the problem of efficiently learning from costly human feedback in preference-based settings by introducing an offline contextual dueling bandit framework, where the agent selects contexts to optimize feedback acquisition, and demonstrates that their algorithm outperforms uniform sampling with a proven regret bound.

Preference-based feedback is important for many applications where direct evaluation of a reward function is not feasible. A notable recent example arises in reinforcement learning from human feedback on large language models. For many of these applications, the cost of acquiring the human feedback can be substantial or even prohibitive. In this work, we take advantage of the fact that often the agent can choose contexts at which to obtain human feedback in order to most efficiently identify a good policy, and introduce the offline contextual dueling bandit setting. We give an upper-confidence-bound style algorithm for this setting and prove a regret bound. We also give empirical confirmation that this method outperforms a similar strategy that uses uniformly sampled contexts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes