CLAIMay 30, 2025

Reasoning Models Hallucinate More: Factuality-Aware Reinforcement Learning for Large Reasoning Models

arXiv:2505.24630v211 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses reliability issues in reasoning models for AI applications, but it is incremental as it builds on existing RL methods.

The paper tackles the problem that reinforcement learning fine-tuning for large language models increases hallucinations in reasoning tasks, and proposes a factuality-aware algorithm that reduces hallucinations and improves accuracy in experiments with models like Qwen2.5 and Llama.

Large language models (LLMs) have significantly advanced in reasoning tasks through reinforcement learning (RL) optimization, achieving impressive capabilities across various challenging benchmarks. However, our empirical analysis reveals a critical drawback: reasoning-oriented RL fine-tuning significantly increases the prevalence of hallucinations. We theoretically analyze the RL training dynamics, identifying high-variance gradient, entropy-induced randomness, and susceptibility to spurious local optima as key factors leading to hallucinations. To address this drawback, we propose Factuality-aware Step-wise Policy Optimization (FSPO), an innovative RL fine-tuning algorithm incorporating explicit factuality verification at each reasoning step. FSPO leverages automated verification against given evidence to dynamically adjust token-level advantage values, incentivizing factual correctness throughout the reasoning process. Experiments across mathematical reasoning and hallucination benchmarks using Qwen2.5 and Llama models demonstrate that FSPO effectively reduces hallucinations while enhancing reasoning accuracy, substantially improving both reliability and performance.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes