Planning Transformer: Long-Horizon Offline Reinforcement Learning with Planning Tokens
This addresses a key limitation in offline RL for long-horizon tasks, offering improved performance and interpretability, though it is an incremental architectural modification.
The paper tackles the problem of long-horizon offline reinforcement learning by introducing Planning Tokens to reduce compounding error in auto-regressive models, achieving state-of-the-art performance on complex D4RL environments.
Supervised learning approaches to offline reinforcement learning, particularly those utilizing the Decision Transformer, have shown effectiveness in continuous environments and for sparse rewards. However, they often struggle with long-horizon tasks due to the high compounding error of auto-regressive models. To overcome this limitation, we go beyond next-token prediction and introduce Planning Tokens, which contain high-level, long time-scale information about the agent's future. Predicting dual time-scale tokens at regular intervals enables our model to use these long-horizon Planning Tokens as a form of implicit planning to guide its low-level policy and reduce compounding error. This architectural modification significantly enhances performance on long-horizon tasks, establishing a new state-of-the-art in complex D4RL environments. Additionally, we demonstrate that Planning Tokens improve the interpretability of the model's policy through the interpretable plan visualisations and attention map.