Backward Learning for Goal-Conditioned Policies
This addresses the challenge of reward-free policy learning for goal-conditioned tasks, though it appears incremental as it builds on existing imitation learning and world modeling techniques.
The paper tackles the problem of learning policies in reinforcement learning without rewards by proposing a multi-step procedure that uses a backward world model, generates goal-reaching trajectories, and trains a policy via imitation learning, showing consistent goal-reaching in a deterministic maze environment with 64x64 pixel images.
Can we learn policies in reinforcement learning without rewards? Can we learn a policy just by trying to reach a goal state? We answer these questions positively by proposing a multi-step procedure that first learns a world model that goes backward in time, secondly generates goal-reaching backward trajectories, thirdly improves those sequences using shortest path finding algorithms, and finally trains a neural network policy by imitation learning. We evaluate our method on a deterministic maze environment where the observations are $64\times 64$ pixel bird's eye images and can show that it consistently reaches several goals.