LGApr 22, 2025

Learning Explainable Dense Reward Shapes via Bayesian Optimization

DeepMind
arXiv:2504.16272v12 citationsh-index: 13
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in RLHF pipelines for LLM alignment, offering an incremental improvement in token-level credit assignment.

The paper tackles the problem of sparse feedback and suboptimal token-level credit assignment in RLHF for LLM alignment by proposing a reward-shaping function using explainability methods like SHAP and LIME to estimate per-token rewards, with experiments showing performance improvements on downstream tasks and faster policy training.

Current reinforcement learning from human feedback (RLHF) pipelines for large language model (LLM) alignment typically assign scalar rewards to sequences, using the final token as a surrogate indicator for the quality of the entire sequence. However, this leads to sparse feedback and suboptimal token-level credit assignment. In this work, we frame reward shaping as an optimization problem focused on token-level credit assignment. We propose a reward-shaping function leveraging explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model. To learn parameters of this shaping function, we employ a bilevel optimization framework that integrates Bayesian Optimization and policy training to handle noise from the token reward estimates. Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines on downstream tasks and finds an optimal policy faster during training. Furthermore, we show theoretically that explainability methods that are feature additive attribution functions maintain the optimal policy as the original reward.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes