LG IRMay 27

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

Youting Wang, Yuan Tang, Bowen Liu, Xuan Liu, Dingyan Shang

arXiv:2605.2891860.9

Predicted impact top 36% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers using LLMs to design reward functions in sparse structured RL, this work provides a debugging framework that significantly improves performance over one-shot generation.

LLM-generated reward shaping for sparse structured RL tasks often fails due to reward flooding and semantic/API misunderstandings. The proposed diagnostic-driven iterative refinement improves success rates from 2.3% to 97.6% on DoorKey-8x8 and from 31.2% to 86.7% on KeyCorridor, with gains attributed to the taxonomy prompt rather than retrying or extra training.

For sparse, structured reinforcement-learning tasks with semantic reward-function interfaces, LLM-generated reward shaping is better framed as debugging than one-shot generation. We study PPO-trained agents using MiniGrid as core evaluation and MuJoCo as boundary stress test. Our audit finds two dominant one-shot failure modes -- reward flooding and semantic/API misunderstanding -- plus a rarer weak-shaping case. We propose diagnostic-driven iterative refinement, where training diagnostics and a failure-mode taxonomy guide targeted reward-function revision. Refinement improves DoorKey-8x8 from 2.3% to 97.6% and KeyCorridor from 31.2% to 86.7% with high seed-to-seed variance. Controls show these gains are not from retrying or extra training: metrics-only re-prompting yields large drops, while a static-vocabulary control recovers much of the gap (87.6%; 70.7%), showing the taxonomy prompt is a major mechanism and dynamic labels provide only partially isolated incremental evidence. Budget-matched and Best-of-3 comparisons separate refinement from selection and training-time effects. Component-removal tests, sensitivity analyses, and an audit against author labels provide converging evidence for the debugging interpretation while revealing calibration limits. Continuous-control results show the boundary: success-based diagnostics can misfire in dense-reward locomotion, and return-trend feedback removes one false-positive mechanism without robust gains. The low-call protocol is a cost contrast with population-based reward search, not a benchmark comparison. In four crossed-variance-design environments, point estimates suggest larger gains when LLM reward-function variance dominates but bootstrap intervals are wide. The method is bounded to sparse structured tasks with reliable interfaces under PPO; fields like event_text may help, hurt, or be neutral.

View on arXiv PDF

Similar