Recurrent Off-policy Baselines for Memory-based Continuous Control
This work provides a model-free baseline for history-based RL, addressing a gap for researchers in reinforcement learning, but it is incremental as it adapts existing methods to a specific context.
The authors tackled the problem of partially observable continuous control in deep reinforcement learning by implementing recurrent versions of DDPG, TD3, and SAC (RDPG, RTD3, RSAC) and evaluating them on short-term and long-term domains, finding that RSAC was the most reliable, achieving near-optimal performance on nearly all domains, though one exploration task remained difficult.
When the environment is partially observable (PO), a deep reinforcement learning (RL) agent must learn a suitable temporal representation of the entire history in addition to a strategy to control. This problem is not novel, and there have been model-free and model-based algorithms proposed for this problem. However, inspired by recent success in model-free image-based RL, we noticed the absence of a model-free baseline for history-based RL that (1) uses full history and (2) incorporates recent advances in off-policy continuous control. Therefore, we implement recurrent versions of DDPG, TD3, and SAC (RDPG, RTD3, and RSAC) in this work, evaluate them on short-term and long-term PO domains, and investigate key design choices. Our experiments show that RDPG and RTD3 can surprisingly fail on some domains and that RSAC is the most reliable, reaching near-optimal performance on nearly all domains. However, one task that requires systematic exploration still proved to be difficult, even for RSAC. These results show that model-free RL can learn good temporal representation using only reward signals; the primary difficulty seems to be computational cost and exploration. To facilitate future research, we have made our PyTorch implementation publicly available at https://github.com/zhihanyang2022/off-policy-continuous-control.