LGAICLDec 31, 2024

Reinforcing Thinking through Reasoning-Enhanced Reward Models

arXiv:2501.01457v15 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses a key limitation in LLM reasoning for AI applications, though it is incremental as it builds on existing self-critique methods.

The paper tackles the problem of LLMs struggling to decide when to stop thinking in multi-step reasoning by proposing the DRR framework, which distills reasoning processes into synthetic data to train a reward model, resulting in outperformance over self-critique approaches on benchmarks without manual labeling.

Large Language Models (LLMs) exhibit great potential in complex multi-step reasoning through inference-time thinking but still struggle with deciding when to stop thinking due to limited self-awareness about their knowledge boundaries. While human preference alignment has shown extraordinary opportunities, expensive labeling challenges adherence to scaling law. Language model self-critique, as an alternative to using human-labeled reasoning data, is questioned with its inherited biases. This work addresses these challenges by distilling the LLM's own reasoning processes into synthetic behavioral data, eliminating the need for manual labeling of intermediate steps. Building on this concept, we propose Distillation-Reinforcement-Reasoning (DRR), a three-step framework that leverages the LLM's inherent behaviors as external feedback by first generating behavioral data using the Reasoner (LLM) to reflect its reasoning capabilities, then training a lightweight discriminative reward model (DM) on behavioral data, and finally deploying the DM at inference time to assist the Reasoner's decision-making. Experiments on multiple benchmarks show that the DRR framework outperforms self-critique approaches without relying on additional complex data annotation. Benefiting from lightweight design, ease of replication, and adaptability, DRR is applicable to a wide range of LLM-centric tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes