Learning Relative Return Policies With Upside-Down Reinforcement Learning
This is an incremental advancement for reinforcement learning researchers, demonstrating the method's potential under more complex command structures.
The paper tackled the problem of using upside-down reinforcement learning to learn policies that follow commands specifying a desired relationship between a scalar value and observed return, showing it works online in a tabular bandit setting and in CartPole with non-linear function approximation.
Lately, there has been a resurgence of interest in using supervised learning to solve reinforcement learning problems. Recent work in this area has largely focused on learning command-conditioned policies. We investigate the potential of one such method -- upside-down reinforcement learning -- to work with commands that specify a desired relationship between some scalar value and the observed return. We show that upside-down reinforcement learning can learn to carry out such commands online in a tabular bandit setting and in CartPole with non-linear function approximation. By doing so, we demonstrate the power of this family of methods and open the way for their practical use under more complicated command structures.