LGJan 2
IRPM: Intergroup Relative Preference Modeling for Pointwise Generative Reward ModelsHaonan Song, Qingchen Xie, Huan Zhu et al.
Generative Reward Models (GRMs) have demonstrated strong performance in reward modeling, due to their interpretability and potential for refinement through reinforcement learning (RL). However, widely used pairwise GRMs create a computational bottleneck in reinforcement learning from human feedback (RLHF), when calibrating or aggregating preference signals over n candidates, often incurring O(n^2) pairwise judgments. To address this issue, we propose Intergroup Relative Preference Modeling (IRPM), an RL-based method that extends the Bradley--Terry preference-learning paradigm via intergroup comparisons to train pointwise GRMs from pairwise preference data. IRPM derives pointwise reward for each response by contrasting groups of chosen vs. rejected samples, enabling pointwise scores comparable across candidate sets and O(n) reward evaluation for a variable number of candidates during RL training, while preserving interpretability and scalability. Experiments show that IRPM achieves state-of-the-art performance among pointwise GRMs on RM-Bench, JudgeBench and RewardBench, and approaches the performance of leading pairwise GRMs. In addition, IRPM achieves substantial gains in post-training evaluations, demonstrating its effectiveness.
CVFeb 10, 2020
From Anchor Generation to Distribution Alignment: Learning a Discriminative Embedding Space for Zero-Shot RecognitionFuzhen Li, Zhenfeng Zhu, Xingxing Zhang et al.
In zero-shot learning (ZSL), the samples to be classified are usually projected into side information templates such as attributes. However, the irregular distribution of templates makes classification results confused. To alleviate this issue, we propose a novel framework called Discriminative Anchor Generation and Distribution Alignment Model (DAGDA). Firstly, in order to rectify the distribution of original templates, a diffusion based graph convolutional network, which can explicitly model the interaction between class and side information, is proposed to produce discriminative anchors. Secondly, to further align the samples with the corresponding anchors in anchor space, which aims to refine the distribution in a fine-grained manner, we introduce a semantic relation regularization in anchor space. Following the way of inductive learning, our approach outperforms some existing state-of-the-art methods, on several benchmark datasets, for both conventional as well as generalized ZSL setting. Meanwhile, the ablation experiments strongly demonstrate the effectiveness of each component.