LGNov 4, 2021

Towards an Understanding of Default Policies in Multitask Policy Optimization

arXiv:2111.02994v412 citations
Originality Incremental advance
AI Analysis

This work provides foundational insights for developing more generally capable agents in multitask settings, though it is incremental as it builds on existing regularized policy optimization methods.

The paper addresses the lack of formal understanding of default policies in multitask reinforcement learning, linking default policy quality to optimization effects and deriving a principled algorithm with strong performance guarantees.

Much of the recent success of deep reinforcement learning has been driven by regularized policy optimization (RPO) algorithms with strong performance across multiple domains. In this family of methods, agents are trained to maximize cumulative reward while penalizing deviation in behavior from some reference, or default policy. In addition to empirical success, there is a strong theoretical foundation for understanding RPO methods applied to single tasks, with connections to natural gradient, trust region, and variational approaches. However, there is limited formal understanding of desirable properties for default policies in the multitask setting, an increasingly important domain as the field shifts towards training more generally capable agents. Here, we take a first step towards filling this gap by formally linking the quality of the default policy to its effect on optimization. Using these results, we then derive a principled RPO algorithm for multitask learning with strong performance guarantees.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes