LG AI IT ST MLJun 18, 2024

On the Exponential Convergence for Offline RLHF with Pairwise Comparisons

arXiv:2406.12205v24.61 citations

Originality Highly original

AI Analysis

This work addresses the challenge of efficiently learning optimal policies from offline human feedback data for applications like AI alignment, offering a significant improvement over prior inverse polynomial convergence rates.

The paper tackles the problem of offline reinforcement learning from human feedback (RLHF) with pairwise comparisons, proposing an algorithm (RL-LOW) that achieves exponential convergence in simple regret, specifically exp(-Ω(n/H)), where n is the number of samples and H is an instance-dependent hardness parameter, and shows order-wise optimality by matching a derived lower bound.

We consider the problem of offline reinforcement learning from human feedback (RLHF) with pairwise comparisons proposed by Zhu et al. (2023), where the implicit reward is a linear function of an unknown parameter. Given an offline dataset, our objective consists in ascertaining the optimal action for each state, with the ultimate goal of minimizing the {\em simple regret}. We propose an algorithm, \underline{RL} with \underline{L}ocally \underline{O}ptimal \underline{W}eights or {\sc RL-LOW}, which yields an exponential form of simple regret of $\exp ( - Ω(n/H) )$ where $n$ is the number of data samples and $H$ denotes an instance-dependent hardness quantity that depends explicitly on the suboptimality gap of each action. Furthermore, we derive a first-of-its-kind instance-dependent lower bound in offline RLHF with pairwise comparisons. Interestingly, we observe that the lower and upper bounds on the simple regret match order-wise in the exponent, demonstrating order-wise optimality of our {\sc RL-LOW}. In view of privacy considerations in practical applications, we also extend {\sc RL-LOW} to the setting of $(\varepsilon,δ)$-differential privacy and show, somewhat surprisingly, that the hardness parameter $H$ is unchanged in the asymptotic regime as $n$ tends to infinity; this underscores the inherent efficiency of {\sc RL-LOW} in terms of preserving the privacy of the observed rewards. Given our focus on establishing instance-dependent bounds of exponential convergence, our research fills the research gap in existing studies that concentrate on establishing worst-case regrets of {\em inverse polynomial convergence} (e.g., $\widetilde{O}(\frac{1}{\sqrt{n}})$) for offline RLHF with pairwise comparisons.

View on arXiv PDF

Similar