LG AIMay 22, 2025

How Ensembles of Distilled Policies Improve Generalisation in Reinforcement Learning

Max Weltevrede, Moritz A. Zanger, Matthijs T. J. Spaan, Wendelin Böhmer

arXiv:2505.16581v24.1h-index: 33

Originality Incremental advance

AI Analysis

This work addresses the challenge of generalization in reinforcement learning for agents in unseen environments, offering incremental improvements through theoretical insights and empirical validation.

The paper tackles the problem of improving generalization in reinforcement learning for zero-shot policy transfer by proving a generalization bound for policy distillation, leading to practical insights that ensembles of distilled policies trained on diverse data enhance performance. The result shows that such ensembles can generalize significantly better than the original agent, with empirical verification in more general settings.

In the zero-shot policy transfer setting in reinforcement learning, the goal is to train an agent on a fixed set of training environments so that it can generalise to similar, but unseen, testing environments. Previous work has shown that policy distillation after training can sometimes produce a policy that outperforms the original in the testing environments. However, it is not yet entirely clear why that is, or what data should be used to distil the policy. In this paper, we prove, under certain assumptions, a generalisation bound for policy distillation after training. The theory provides two practical insights: for improved generalisation, you should 1) train an ensemble of distilled policies, and 2) distil it on as much data from the training environments as possible. We empirically verify that these insights hold in more general settings, when the assumptions required for the theory no longer hold. Finally, we demonstrate that an ensemble of policies distilled on a diverse dataset can generalise significantly better than the original agent.

View on arXiv PDF

Similar