Curiosity-Driven Reinforcement Learning from Human Feedback
This addresses the problem of reduced diversity in aligned LLMs for users needing varied outputs, representing an incremental improvement over standard RLHF.
The paper tackles the trade-off between output diversity and alignment quality in reinforcement learning from human feedback (RLHF) for large language models, introducing curiosity-driven RLHF (CD-RLHF) to achieve significant gains in diversity on multiple metrics while maintaining comparable alignment.
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but often at the cost of reduced output diversity. This trade-off between diversity and alignment quality remains a significant challenge. Drawing inspiration from curiosity-driven exploration in reinforcement learning, we introduce curiosity-driven RLHF (CD-RLHF), a framework that incorporates intrinsic rewards for novel states, alongside traditional sparse extrinsic rewards, to optimize both output diversity and alignment quality. We demonstrate the effectiveness of CD-RLHF through extensive experiments on a range of tasks, including text summarization and instruction following. Our approach achieves significant gains in diversity on multiple diversity-oriented metrics while maintaining alignment with human preferences comparable to standard RLHF. We make our code publicly available at https://github.com/ernie-research/CD-RLHF.