LG AIJan 2

IRPM: Intergroup Relative Preference Modeling for Pointwise Generative Reward Models

Haonan Song, Qingchen Xie, Huan Zhu, Feng Xiao, Luxi Xing, Liu Kang, Fuzhen Li, Zhiyong Zheng, Feng Jiang, Ziheng Li, Kun Yan, Qingyi Si

arXiv:2601.00677v22.71 citationsh-index: 6

Originality Incremental advance

AI Analysis

This addresses efficiency issues in reinforcement learning from human feedback for AI alignment, though it is incremental as it builds on existing preference-learning paradigms.

The paper tackled the computational bottleneck of pairwise generative reward models in RLHF by proposing IRPM, a method that trains pointwise GRMs via intergroup comparisons, achieving state-of-the-art performance on benchmarks like RM-Bench and approaching pairwise GRM performance with O(n) reward evaluation.

Generative Reward Models (GRMs) have demonstrated strong performance in reward modeling, due to their interpretability and potential for refinement through reinforcement learning (RL). However, widely used pairwise GRMs create a computational bottleneck in reinforcement learning from human feedback (RLHF), when calibrating or aggregating preference signals over n candidates, often incurring O(n^2) pairwise judgments. To address this issue, we propose Intergroup Relative Preference Modeling (IRPM), an RL-based method that extends the Bradley--Terry preference-learning paradigm via intergroup comparisons to train pointwise GRMs from pairwise preference data. IRPM derives pointwise reward for each response by contrasting groups of chosen vs. rejected samples, enabling pointwise scores comparable across candidate sets and O(n) reward evaluation for a variable number of candidates during RL training, while preserving interpretability and scalability. Experiments show that IRPM achieves state-of-the-art performance among pointwise GRMs on RM-Bench, JudgeBench and RewardBench, and approaches the performance of leading pairwise GRMs. In addition, IRPM achieves substantial gains in post-training evaluations, demonstrating its effectiveness.

View on arXiv PDF

Similar