LGApr 8, 2025

Adversarial Training of Reward Models

NVIDIA
arXiv:2504.06141v212 citationsh-index: 29
Originality Incremental advance
AI Analysis

This addresses a critical issue in scalable alignment of language models for AI safety, though it is an incremental improvement on existing reward modeling methods.

The paper tackles the problem of reward models lacking robustness and being vulnerable to reward hacking by introducing Adv-RM, an adversarial training framework that identifies adversarial examples to improve RM robustness, resulting in significant performance gains in RLHF training.

Reward modeling has emerged as a promising approach for the scalable alignment of language models. However, contemporary reward models (RMs) often lack robustness, awarding high rewards to low-quality, out-of-distribution (OOD) samples. This can lead to reward hacking, where policies exploit unintended shortcuts to maximize rewards, undermining alignment. To address this challenge, we introduce Adv-RM, a novel adversarial training framework that automatically identifies adversarial examples -- responses that receive high rewards from the target RM but are OOD and of low quality. By leveraging reinforcement learning, Adv-RM trains a policy to generate adversarial examples that reliably expose vulnerabilities in large state-of-the-art reward models such as Nemotron 340B RM. Incorporating these adversarial examples into the reward training process improves the robustness of RMs, mitigating reward hacking and enhancing downstream performance in RLHF. We demonstrate that Adv-RM significantly outperforms conventional RM training, increasing stability and enabling more effective RLHF training in both synthetic and real-data settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes