LG AIAug 28, 2023

On Reward Structures of Markov Decision Processes

arXiv:2308.14919v22.01 citationsh-index: 7

Originality Incremental advance

AI Analysis

This work addresses fundamental challenges in reinforcement learning, such as sample efficiency and safety, which are critical for applications like robotics, but it is largely incremental, building on existing concepts like reward shaping and MDP constants.

The paper tackles the problem of understanding and optimizing reward structures in Markov decision processes (MDPs) for reinforcement learning, resulting in a novel estimator with an instance-specific error bound of $ ilde{O}(\sqrt{rac{τ_s}{n}})$ for policy evaluation, theoretical insights into reward shaping, and algorithms for safe learning and multi-reward planning with promising numerical results.

A Markov decision process can be parameterized by a transition kernel and a reward function. Both play essential roles in the study of reinforcement learning as evidenced by their presence in the Bellman equations. In our inquiry of various kinds of "costs" associated with reinforcement learning inspired by the demands in robotic applications, rewards are central to understanding the structure of a Markov decision process and reward-centric notions can elucidate important concepts in reinforcement learning. Specifically, we study the sample complexity of policy evaluation and develop a novel estimator with an instance-specific error bound of $\tilde{O}(\sqrt{\frac{τ_s}{n}})$ for estimating a single state value. Under the online regret minimization setting, we refine the transition-based MDP constant, diameter, into a reward-based constant, maximum expected hitting cost, and with it, provide a theoretical explanation for how a well-known technique, potential-based reward shaping, could accelerate learning with expert knowledge. In an attempt to study safe reinforcement learning, we model hazardous environments with irrecoverability and proposed a quantitative notion of safe learning via reset efficiency. In this setting, we modify a classic algorithm to account for resets achieving promising preliminary numerical results. Lastly, for MDPs with multiple reward functions, we develop a planning algorithm that computationally efficiently finds Pareto-optimal stochastic policies.

View on arXiv PDF

Similar