CL AIOct 8, 2025

LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

Zecheng Tang, Baibei Ji, Quantong Qiu, Haitian Wang, Xiaobo Liang, Juntao Li, Min Zhang

arXiv:2510.06915v29.65 citationsh-index: 10

Originality Incremental advance

AI Analysis

This addresses the critical need for reliable reward modeling in long-context applications like LLM agents, which is an incremental improvement over existing short-context methods.

The authors tackled the problem of reward models failing in long-context scenarios by introducing Long-RewardBench for evaluation and a multi-stage training strategy to create robust Long-context Reward Models (LongRMs). Their 8B LongRM outperformed 70B-scale baselines and matched the proprietary Gemini 2.5 Pro model.

Reward model (RM) plays a pivotal role in aligning large language model (LLM) with human preferences. As real-world applications increasingly involve long history trajectories, e.g., LLM agent, it becomes indispensable to evaluate whether a model's responses are not only high-quality but also grounded in and consistent with the provided context. Yet, current RMs remain confined to short-context settings and primarily focus on response-level attributes (e.g., safety or helpfulness), while largely neglecting the critical dimension of long context-response consistency. In this work, we introduce Long-RewardBench, a benchmark specifically designed for long-context RM evaluation, featuring both Pairwise Comparison and Best-of-N tasks. Our preliminary study reveals that even state-of-the-art generative RMs exhibit significant fragility in long-context scenarios, failing to maintain context-aware preference judgments. Motivated by the analysis of failure patterns observed in model outputs, we propose a general multi-stage training strategy that effectively scales arbitrary models into robust Long-context RMs (LongRMs). Experiments show that our approach not only substantially improves performance on long-context evaluation but also preserves strong short-context capability. Notably, our 8B LongRM outperforms much larger 70B-scale baselines and matches the performance of the proprietary Gemini 2.5 Pro model.

View on arXiv PDF

Similar