LG AI MLJun 24, 2019

Ranking Policy Gradient

arXiv:1906.09674v38.110 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses sample inefficiency for reinforcement learning practitioners, offering an incremental improvement with a new off-policy framework.

The paper tackles the sample inefficiency problem in reinforcement learning by proposing Ranking Policy Gradient (RPG), a method that learns optimal action ranks, and shows it reduces sample complexity compared to state-of-the-art methods in experiments.

Sample inefficiency is a long-lasting problem in reinforcement learning (RL). The state-of-the-art estimates the optimal action values while it usually involves an extensive search over the state-action space and unstable optimization. Towards the sample-efficient RL, we propose ranking policy gradient (RPG), a policy gradient method that learns the optimal rank of a set of discrete actions. To accelerate the learning of policy gradient methods, we establish the equivalence between maximizing the lower bound of return and imitating a near-optimal policy without accessing any oracles. These results lead to a general off-policy learning framework, which preserves the optimality, reduces variance, and improves the sample-efficiency. Furthermore, the sample complexity of RPG does not depend on the dimension of state space, which enables RPG for large-scale problems. We conduct extensive experiments showing that when consolidating with the off-policy learning framework, RPG substantially reduces the sample complexity, comparing to the state-of-the-art.

View on arXiv PDF Code

Similar