Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures
This work addresses exploration challenges in reinforcement learning for agents, offering a method that improves coverage and performance, though it appears incremental as it builds on existing maximum entropy RL frameworks.
The paper tackles the problem of exploration in reinforcement learning by proposing a novel intrinsic reward based on the relative entropy of future state-action distributions, which maximizes a lower bound on the state-action value function. The resulting policies achieve good state-action space coverage and high-performance control.
Maximum entropy reinforcement learning integrates exploration into policy learning by providing additional intrinsic rewards proportional to the entropy of some distribution. In this paper, we propose a novel approach in which the intrinsic reward function is the relative entropy of the discounted distribution of states and actions (or features derived from these states and actions) visited during future time steps. This approach is motivated by two results. First, a policy maximizing the expected discounted sum of intrinsic rewards also maximizes a lower bound on the state-action value function of the decision process. Second, the distribution used in the intrinsic reward definition is the fixed point of a contraction operator. Existing algorithms can therefore be adapted to learn this fixed point off-policy and to compute the intrinsic rewards. We finally introduce an algorithm maximizing our new objective, and we show that resulting policies have good state-action space coverage and achieve high-performance control.