CLJul 4, 2024

HAF-RM: A Hybrid Alignment Framework for Reward Model Training

arXiv:2407.04185v45 citationsh-index: 20
Originality Incremental advance
AI Analysis

This work addresses the need for better reward models in LLM alignment, offering an incremental improvement over existing methods.

The paper tackles the problem of improving reward models for large language models by proposing a hybrid alignment framework that adds token-level policy probability constraints to conventional reward optimization, resulting in enhanced performance and alignment across five datasets.

The reward model has become increasingly important in alignment, assessment, and data construction for large language models (LLMs). Most existing researchers focus on enhancing reward models through data improvements, following the conventional training framework for reward models that directly optimizes the predicted rewards. In this paper, we propose a hybrid alignment framework HaF-RM for reward model training by introducing an additional constraint on token-level policy probabilities in addition to the reward score. It can simultaneously supervise the internal preference model at the token level and optimize the mapping layer of the reward model at the sequence level. Experiment results on five datasets sufficiently show the validity and effectiveness of our proposed hybrid framework for training a high-quality reward model. By decoupling the reward modeling procedure and incorporating hybrid supervision, our HaF-RM framework offers a principled and effective approach to enhancing the performance and alignment of reward models, a critical component in the responsible development of powerful language models. We release our code at https://haf-rm.github.io.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes