LGAISep 19, 2025

Reward Hacking Mitigation using Verifiable Composite Rewards

arXiv:2509.15557v15 citationsh-index: 16BCB
Originality Incremental advance
AI Analysis

This addresses reliability issues in medical AI applications, though it is an incremental improvement to existing RLVR methods.

The paper tackled reward hacking in medical question answering by introducing a composite reward function with penalties for problematic behaviors, resulting in better-formatted reasoning and good accuracy compared to baselines.

Reinforcement Learning from Verifiable Rewards (RLVR) has recently shown that large language models (LLMs) can develop their own reasoning without direct supervision. However, applications in the medical domain, specifically for question answering, are susceptible to significant reward hacking during the reasoning phase. Our work addresses two primary forms of this behavior: i) providing a final answer without preceding reasoning, and ii) employing non-standard reasoning formats to exploit the reward mechanism. To mitigate these, we introduce a composite reward function with specific penalties for these behaviors. Our experiments show that extending RLVR with our proposed reward model leads to better-formatted reasoning with less reward hacking and good accuracy compared to the baselines. This approach marks a step toward reducing reward hacking and enhancing the reliability of models utilizing RLVR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes