CL AIAug 8, 2025

Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future

Yidong Wang, Xin Wang, Cunxiang Wang, Junfeng Fang, Qiufeng Wang, Jianing Chu, Xuran Meng, Shuxun Yang, Libo Qin, Yue Zhang, Wei Ye, Shikun Zhang

arXiv:2508.06026v18.33 citationsh-index: 26

Originality Highly original

AI Analysis

This addresses a critical limitation in iterative self-improvement for large language models, offering a novel method to sustain preference learning, though it is incremental in the context of existing self-rewarding paradigms.

The paper tackles the problem of diminishing learning signals in Self-Rewarding Language Models by proposing Temporal Self-Rewarding Language Models, which decouple chosen and rejected responses using past and future model generations, resulting in a 9.75-point win rate improvement on AlpacaEval 2.0 for Llama3.1-8B.

Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improving its generative capabilities through iterative Direct Preference Optimization (DPO). However, our analysis reveals a critical limitation in existing Self-Rewarding paradigms: the synchronized improvement of chosen and rejected responses progressively narrows the representational difference between contrasting samples, undermining effective preference learning. We propose \textbf{Temporal Self-Rewarding Language Models} that strategically coordinate past, present, and future model generations to sustain learning signals. Our dual-phase framework introduces: (1) \textit{Anchored Rejection} - fixing rejected responses using the past initial model's outputs and (2) \textit{Future-Guided Chosen} - dynamically curating chosen samples using next-generation model predictions. Extensive experiments across three model families (Llama, Qwen, Mistral) and different model sizes (Llama3B/8B/70B) demonstrate significant improvements when trained with our method compared to Self-Rewarding using same computation resources. For example, Llama3.1-8B reaches a 29.44 win rate on AlpacaEval 2.0 with our method, outperforming the Self-Rewarding baseline (19.69) by 9.75. Notably, our method also demonstrates superior out-of-distribution generalization across mathematical reasoning (GSM8K), knowledge-based QA (ARC, TruthfulQA), and code generation (HumanEval) tasks, even though we do not specifically collect such training data.

View on arXiv PDF

Similar