LGCLSep 18, 2025

TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference

arXiv:2509.15110v25 citationsh-index: 13Has Code
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in reward modeling for language models, offering incremental improvements in data efficiency and policy quality for RL and inference tasks.

The paper tackles the problem of temporal inconsistency in reward models for language model reinforcement learning and inference, introducing TDRM to learn smoother reward models by minimizing temporal differences, resulting in performance improvements of up to 6.6% in Best-of-N and 23.7% in tree-search settings, and achieving comparable performance with 2.5k data versus 50.1k for baselines.

Reward models are central to both reinforcement learning (RL) with language models and inference-time verification. However, existing reward models often lack temporal consistency, leading to ineffective policy updates and unstable RL training. We introduce TDRM, a method for learning smoother and more reliable reward models by minimizing temporal differences (TD) for training-time reinforcement learning and inference-time verification. Experiments show that TD-trained process reward models (PRMs) improve performance across Best-of-N (up to 6.6%) and tree-search (up to 23.7%) settings. When combined with Reinforcement Learning with Verifiable Rewards (RLVR), TD-trained PRMs lead to more data-efficient RL -- achieving comparable performance with just 2.5k data to what baseline methods require 50.1k data to attain -- and yield higher-quality language model policies in 8 model variants (5 series), e.g., Qwen2.5-(0.5B, 1,5B), GLM4-9B-0414, GLM-Z1-9B-0414, Qwen2.5-Math-(1.5B, 7B), and DeepSeek-R1-Distill-Qwen-(1.5B, 7B). We release all code at https://github.com/THUDM/TDRM.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes