LGMLFeb 9, 2020

Reward Tweaking: Maximizing the Total Reward While Planning for Short Horizons

arXiv:2002.03327v25 citations
AI Analysis

This addresses the issue of sub-optimal behavior in reinforcement learning when using short planning horizons for stability, which is incremental as it builds on existing discount factor adjustments.

The paper tackles the problem of deep reinforcement learning agents becoming unstable with long planning horizons by introducing reward tweaking, a method that learns a surrogate reward function for discounted settings to achieve optimal behavior on the original finite-horizon total reward task, with experiments in high-dimensional continuous control showing improved long-horizon returns.

In reinforcement learning, the discount factor $γ$ controls the agent's effective planning horizon. Traditionally, this parameter was considered part of the MDP; however, as deep reinforcement learning algorithms tend to become unstable when the effective planning horizon is long, recent works refer to $γ$ as a hyper-parameter -- thus changing the underlying MDP and potentially leading the agent towards sub-optimal behavior on the original task. In this work, we introduce \emph{reward tweaking}. Reward tweaking learns a surrogate reward function $\tilde r$ for the discounted setting that induces optimal behavior on the original finite-horizon total reward task. Theoretically, we show that there exists a surrogate reward that leads to optimality in the original task and discuss the robustness of our approach. Additionally, we perform experiments in high-dimensional continuous control tasks and show that reward tweaking guides the agent towards better long-horizon returns although it plans for short horizons.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes