CLMay 5, 2025

Sailing by the Stars: A Survey on Reward Models and Learning Strategies for Learning from Rewards

arXiv:2505.02686v215.57 citationsh-index: 2Has Code

Originality Synthesis-oriented

AI Analysis

It synthesizes existing research on reward models and learning strategies, serving as a resource for researchers and practitioners in AI, but is incremental as a survey.

This survey provides a comprehensive overview of the Learning from Rewards paradigm in Large Language Models, which uses reward signals to steer behavior and enables active learning from dynamic feedback for improved alignment and reasoning.

Recent developments in Large Language Models (LLMs) have shifted from pre-training scaling to post-training and test-time scaling. Across these developments, a key unified paradigm has arisen: Learning from Rewards, where reward signals act as the guiding stars to steer LLM behavior. It has underpinned a wide range of prevalent techniques, such as reinforcement learning (RLHF, RLAIF, DPO, and GRPO), reward-guided decoding, and post-hoc correction. Crucially, this paradigm enables the transition from passive learning from static data to active learning from dynamic feedback. This endows LLMs with aligned preferences and deep reasoning capabilities for diverse tasks. In this survey, we present a comprehensive overview of learning from rewards, from the perspective of reward models and learning strategies across training, inference, and post-inference stages. We further discuss the benchmarks for reward models and the primary applications. Finally we highlight the challenges and future directions. We maintain a paper collection at https://github.com/bobxwu/learning-from-rewards-llm-papers.

View on arXiv PDF Code

Similar