DNA: Proximal Policy Optimization with a Dual Network Architecture
This addresses a key bottleneck in deep reinforcement learning for improving stability and performance in stochastic control settings.
The paper tackles the sub-optimal joint learning of value functions and policies in actor-critic reinforcement learning by proposing a dual network architecture that learns them independently with constrained distillation, significantly outperforming Proximal Policy Optimization and Rainbow DQN on four out of five environments.
This paper explores the problem of simultaneously learning a value function and policy in deep actor-critic reinforcement learning models. We find that the common practice of learning these functions jointly is sub-optimal, due to an order-of-magnitude difference in noise levels between these two tasks. Instead, we show that learning these tasks independently, but with a constrained distillation phase, significantly improves performance. Furthermore, we find that the policy gradient noise levels can be decreased by using a lower \textit{variance} return estimate. Whereas, the value learning noise level decreases with a lower \textit{bias} estimate. Together these insights inform an extension to Proximal Policy Optimization we call \textit{Dual Network Architecture} (DNA), which significantly outperforms its predecessor. DNA also exceeds the performance of the popular Rainbow DQN algorithm on four of the five environments tested, even under more difficult stochastic control settings.