DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging
This addresses the problem of aligning LLMs with domain-specific preferences for researchers and practitioners, though it is incremental as it builds on existing model merging techniques.
The paper tackles the challenge of costly preference data collection for reward models in RLHF by proposing DogeRM, which integrates domain-specific knowledge through model merging, resulting in enhanced performance across benchmarks.
Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors. Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the \textbf{Do}main knowled\textbf{ge} merged \textbf{R}eward \textbf{M}odel (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.