CLAIJun 23, 2025

A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models

arXiv:2506.18485v23 citationsh-index: 16
Originality Incremental advance
AI Analysis

This work addresses the problem of inefficient reinforcement finetuning for large reasoning models, offering an incremental improvement by leveraging in-context learning to align generation with optimization.

The paper tackles the inefficiency of Reinforcement Learning with Verifiable Rewards (RLVR) for Large Reasoning Models by introducing Motivation-enhanced Reinforcement Finetuning (MeRF), which injects reward specifications into prompts to enhance model awareness, resulting in substantial performance gains over the RLVR baseline.

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful learn-to-reason paradigm for Large Reasoning Models to tackle complex tasks. However, current RLVR paradigm is still not efficient enough, as it works in a trial-and-error manner. To perform better, the model needs to explore the reward space by numerously generating responses and learn from fragmented reward signals, blind to the overall reward patterns. Fortunately, verifiable rewards make the natural language description of the reward function possible, and meanwhile, LLMs have demonstrated strong in-context learning ability. This motivates us to explore if Large Reasoning Models can benefit from a motivation of the task, i.e., awareness of the reward function, during the reinforcement finetuning process, as we humans sometimes do when learning. In this paper, we introduce Motivation-enhanced Reinforcement Finetuning (MeRF), an intuitive yet effective method enhancing reinforcement finetuning of LLMs by involving ``telling LLMs rules of the game''. Specifically, MeRF directly injects the reward specification into the prompt, which serves as an in-context motivation for the model to be aware of the optimization objective. This simple modification leverages the in-context learning ability of LLMs, aligning generation with optimization, thereby incentivizing the model to generate desired outputs from both inner motivation and external reward. Empirical evaluations demonstrate that MeRF achieves substantial performance gains over RLVR baseline. Moreover, ablation studies show that MeRF performs better with greater consistency between the in-context motivation and the external reward function, while the model also demonstrates an ability to adapt to misleading motivations through reinforcement finetuning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes