LGAIJul 11, 2025

Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data

arXiv:2507.08761v26 citationsh-index: 7ICML
Originality Incremental advance
AI Analysis

This addresses a key bottleneck in offline RL for improving data efficiency and safety in real-world applications, though it appears incremental as it builds on existing methods.

The paper tackles the problem of Q-value extrapolation errors in offline reinforcement learning by proposing a method that guides Q-value decreases outside the data range, resulting in superior performance on the D4RL benchmark, including a notable success in the challenging AntMaze Ultra task.

Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes