Consistent Dropout for Policy Gradient Reinforcement Learning
This addresses a specific technical bottleneck for RL practitioners by enabling stable use of dropout in policy-gradient methods, though it is incremental as it adapts an existing technique.
The paper tackled the instability of naive dropout in policy-gradient reinforcement learning by introducing consistent dropout, which enabled stable training across various environments and architectures, including GPT, without disabling native dropout.
Dropout has long been a staple of supervised learning, but is rarely used in reinforcement learning. We analyze why naive application of dropout is problematic for policy-gradient learning algorithms and introduce consistent dropout, a simple technique to address this instability. We demonstrate consistent dropout enables stable training with A2C and PPO in both continuous and discrete action environments across a wide range of dropout probabilities. Finally, we show that consistent dropout enables the online training of complex architectures such as GPT without needing to disable the model's native dropout.