LG AIJun 7, 2020

Dual Policy Distillation

Kwei-Herng Lai, Daochen Zha, Yuening Li, Xia Hu

arXiv:2006.04061v116.552 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the problem of reducing computational costs and improving learning efficiency in reinforcement learning for researchers and practitioners, though it is incremental as it builds on existing distillation methods.

The paper tackles the computational expense and performance limitations of teacher-student policy distillation in deep reinforcement learning by introducing a dual policy distillation framework where two student models learn collaboratively from each other, achieving superior performance on continuous control tasks without needing a teacher model.

Policy distillation, which transfers a teacher policy to a student policy has achieved great success in challenging tasks of deep reinforcement learning. This teacher-student framework requires a well-trained teacher model which is computationally expensive. Moreover, the performance of the student model could be limited by the teacher model if the teacher model is not optimal. In the light of collaborative learning, we study the feasibility of involving joint intellectual efforts from diverse perspectives of student models. In this work, we introduce dual policy distillation(DPD), a student-student framework in which two learners operate on the same environment to explore different perspectives of the environment and extract knowledge from each other to enhance their learning. The key challenge in developing this dual learning framework is to identify the beneficial knowledge from the peer learner for contemporary learning-based reinforcement learning algorithms, since it is unclear whether the knowledge distilled from an imperfect and noisy peer learner would be helpful. To address the challenge, we theoretically justify that distilling knowledge from a peer learner will lead to policy improvement and propose a disadvantageous distillation strategy based on the theoretical results. The conducted experiments on several continuous control tasks show that the proposed framework achieves superior performance with a learning-based agent and function approximation without the use of expensive teacher models.

View on arXiv PDF Code

Similar