LGGTMLFeb 26

Regularized Online RLHF with Generalized Bilinear Preferences

arXiv:2602.23116v24 citationsh-index: 2
AI Analysis

This work provides the first statistically efficient guarantee for online RLHF in high-dimensions, which is significant for researchers and practitioners working on preference learning in complex environments.

This paper addresses contextual online Reinforcement Learning from Human Feedback (RLHF) with generalized preferences, aiming to identify the Nash Equilibrium. The authors propose two algorithms: Greedy Sampling, which achieves polylogarithmic regret of \(\tilde{\mathcal{O}}(ηd^4 (\log T)^2)\), and Explore-Then-Commit, which achieves \(\mathrm{poly}(d)\)-free regret of \(\tilde{\mathcal{O}}(\sqrt{ηr T})\).

We consider the problem of contextual online RLHF with general preferences, where the goal is to identify the Nash Equilibrium. We adopt the Generalized Bilinear Preference Model (GBPM) to capture potentially intransitive preferences via low-rank, skew-symmetric matrices. We investigate general preference learning with any strongly convex regularizer and regularization strength $η^{-1}$, generalizing beyond prior work limited to reverse KL-regularization. Central to our analysis is proving that the dual gap of the greedy policy is bounded by the square of the estimation error, a result derived solely from strong convexity and the skew-symmetry of GBPM. Building on this insight and a feature diversity assumption, we establish two regret bounds via two simple algorithms: (1) Greedy Sampling achieves polylogarithmic, $e^{\mathcal{O}(η)}$-free regret $\tilde{\mathcal{O}}(ηd^4 (\log T)^2)$. (2) Explore-Then-Commit achieves $\mathrm{poly}(d)$-free regret $\tilde{\mathcal{O}}(\sqrt{ηr T})$ by exploiting the low-rank structure; this is the first statistically efficient guarantee for online RLHF in high-dimensions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes