AIOct 2, 2025

Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning

Claudio Fanconi, Nicolás Astorga, Mihaela van der Schaar

arXiv:2510.01857v15.81 citationsh-index: 74

Originality Incremental advance

AI Analysis

This work addresses the challenge of eliciting correct reasoning processes in language models, offering a reusable reward mechanism that could broadly enhance reasoning tasks, though it is incremental in applying existing IRL methods to this domain.

The paper tackled the problem of improving multi-step reasoning in large language models by learning a dense, token-level reward model from expert demonstrations using inverse reinforcement learning, resulting in enhanced predictive performance and error localization, with notable improvements for Llama-based policies on GSM8K.

We reframe and operationalise adversarial inverse reinforcement learning (IRL) to large language model reasoning, learning a dense, token-level reward model for process supervision directly from expert demonstrations rather than imitating style via supervised fine-tuning. The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets. We demonstrate that our approach prioritises correctness over surface form, yielding scores that correlate with eventual answer validity and enabling interpretable localisation of errors within a trace. Empirically, on GSM8K with Llama3 and Qwen2.5 backbones, we demonstrate: (i) dense reasoning rewards can be used as a learning signal to elicit reasoning, and (ii) predictive performance is improved from reward-guided reranking (notably for Llama-based policies). By unifying training signals, inference-time selection, and token-level diagnostics into a single reasoning reward, this work suggests reusable process-level rewards with broad potential to enhance multi-step reasoning in language models.

View on arXiv PDF

Similar