Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret
This addresses the challenge of handling inconsistent human feedback in RLHF systems, which is crucial for practical applications but incremental in theoretical analysis.
The paper tackles the problem of reinforcement learning from multi-source imperfect preferences, where feedback comes from diverse sources with systematic mismatches, and proposes an algorithm with regret $ ilde{O}(\sqrt{K/M}+Ï)$, achieving statistical gains when imperfection is small and robustness when it is large, complemented by a matching lower bound.
Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth objective. In practical RLHF systems, however, feedback is typically \emph{multi-source} (annotators, experts, reward models, heuristics) and can exhibit systematic, persistent mismatches due to subjectivity, expertise variation, and annotation/modeling artifacts. We study episodic RL from \emph{multi-source imperfect preferences} through a cumulative imperfection budget: for each source, the total deviation of its preference probabilities from an ideal oracle is at most $Ï$ over $K$ episodes. We propose a unified algorithm with regret $\tilde{O}(\sqrt{K/M}+Ï)$, which exhibits a best-of-both-regimes behavior: it achieves $M$-dependent statistical gains when imperfection is small (where $M$ is the number of sources), while remaining robust with unavoidable additive dependence on $Ï$ when imperfection is large. We complement this with a lower bound $\tildeΩ(\max\{\sqrt{K/M},Ï\})$, which captures the best possible improvement with respect to $M$ and the unavoidable dependence on $Ï$, and a counterexample showing that naïvely treating imperfect feedback as as oracle-consistent can incur regret as large as $\tildeΩ(\min\{Ï\sqrt{K},K\})$. Technically, our approach involves imperfection-adaptive weighted comparison learning, value-targeted transition estimation to control hidden feedback-induced distribution shift, and sub-importance sampling to keep the weighted objectives analyzable, yielding regret guarantees that quantify when multi-source feedback provably improves RLHF and how cumulative imperfection fundamentally limits it.