LGJan 30

Continual Policy Distillation from Distributed Reinforcement Learning Teachers

Yuxuan Li, Qijun He, Mingqi Yuan, Wen-Tse Chen, Jeff Schneider, Jiayu Chen

arXiv:2601.22475v12.71 citationsh-index: 8

Originality Incremental advance

AI Analysis

This addresses the problem of scalable lifelong learning for AI agents, though it is incremental as it builds on existing policy distillation and mixture-of-experts techniques.

The paper tackles the challenge of continual reinforcement learning by proposing a teacher-student framework that decouples training into distributed RL for single-task teachers and continual distillation into a central model, achieving over 85% of teacher performance with task-wise forgetting constrained to within 10% on the Meta-World benchmark.

Continual Reinforcement Learning (CRL) aims to develop lifelong learning agents to continuously acquire knowledge across diverse tasks while mitigating catastrophic forgetting. This requires efficiently managing the stability-plasticity dilemma and leveraging prior experience to rapidly generalize to novel tasks. While various enhancement strategies for both aspects have been proposed, achieving scalable performance by directly applying RL to sequential task streams remains challenging. In this paper, we propose a novel teacher-student framework that decouples CRL into two independent processes: training single-task teacher models through distributed RL and continually distilling them into a central generalist model. This design is motivated by the observation that RL excels at solving single tasks, while policy distillation -- a relatively stable supervised learning process -- is well aligned with large foundation models and multi-task learning. Moreover, a mixture-of-experts (MoE) architecture and a replay-based approach are employed to enhance the plasticity and stability of the continual policy distillation process. Extensive experiments on the Meta-World benchmark demonstrate that our framework enables efficient continual RL, recovering over 85% of teacher performance while constraining task-wise forgetting to within 10%.

View on arXiv PDF

Similar