TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference
This work addresses a specific bottleneck in reward modeling for language models, offering incremental improvements in data efficiency and policy quality for RL and inference tasks.
The paper tackles the problem of temporal inconsistency in reward models for language model reinforcement learning and inference, introducing TDRM to learn smoother reward models by minimizing temporal differences, resulting in performance improvements of up to 6.6% in Best-of-N and 23.7% in tree-search settings, and achieving comparable performance with 2.5k data versus 50.1k for baselines.
Reward models are central to both reinforcement learning (RL) with language models and inference-time verification. However, existing reward models often lack temporal consistency, leading to ineffective policy updates and unstable RL training. We introduce TDRM, a method for learning smoother and more reliable reward models by minimizing temporal differences (TD) for training-time reinforcement learning and inference-time verification. Experiments show that TD-trained process reward models (PRMs) improve performance across Best-of-N (up to 6.6%) and tree-search (up to 23.7%) settings. When combined with Reinforcement Learning with Verifiable Rewards (RLVR), TD-trained PRMs lead to more data-efficient RL -- achieving comparable performance with just 2.5k data to what baseline methods require 50.1k data to attain -- and yield higher-quality language model policies in 8 model variants (5 series), e.g., Qwen2.5-(0.5B, 1,5B), GLM4-9B-0414, GLM-Z1-9B-0414, Qwen2.5-Math-(1.5B, 7B), and DeepSeek-R1-Distill-Qwen-(1.5B, 7B). We release all code at https://github.com/THUDM/TDRM.