Fusing Rewards and Preferences in Reinforcement Learning
This addresses the challenge of efficiently combining different feedback types for reinforcement learning agents, though it is incremental as it builds on existing methods like SAC and RLHF.
The paper tackles the problem of integrating both rewards and pairwise preferences in reinforcement learning by introducing the Dual-Feedback Actor (DFA) algorithm, which matches or exceeds Soft Actor-Critic on six control environments and outperforms RLHF baselines in a stochastic GridWorld.
We present Dual-Feedback Actor (DFA), a reinforcement learning algorithm that fuses both individual rewards and pairwise preferences (if available) into a single update rule. DFA uses the policy's log-probabilities directly to model the preference probability, avoiding a separate reward-modeling step. Preferences can be provided by human-annotators (at state-level or trajectory-level) or be synthesized online from Q-values stored in an off-policy replay buffer. Under a Bradley-Terry model, we prove that minimizing DFA's preference loss recovers the entropy-regularized Soft Actor-Critic (SAC) policy. Our simulation results show that DFA trained on generated preferences matches or exceeds SAC on six control environments and demonstrates a more stable training process. With only a semi-synthetic preference dataset under Bradley-Terry model, our algorithm outperforms reward-modeling reinforcement learning from human feedback (RLHF) baselines in a stochastic GridWorld and approaches the performance of an oracle with true rewards.