LG AI MLJan 11, 2023

Adversarial Online Multi-Task Reinforcement Learning

arXiv:2301.04268v12.01 citationsh-index: 12Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficient learning across multiple tasks in adversarial environments, which is incremental as it builds on prior task-separability notions and provides new theoretical guarantees.

The paper tackles the problem of adversarial online multi-task reinforcement learning, where a learner faces unknown tasks from a set of MDPs and aims to minimize regret relative to optimal policies for each task. They prove minimax and instance-specific lower bounds on regret and sample complexity, and provide a polynomial-time algorithm with tight dependencies on key parameters.

We consider the adversarial online multi-task reinforcement learning setting, where in each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models. The learner's objective is to minimize its regret with respect to the optimal policy for each task. We assume the MDPs in $\mathcal{M}$ are well-separated under a notion of $λ$-separability, and show that this notion generalizes many task-separability notions from previous works. We prove a minimax lower bound of $Ω(K\sqrt{DSAH})$ on the regret of any learning algorithm and an instance-specific lower bound of $Ω(\frac{K}{λ^2})$ in sample complexity for a class of uniformly-good cluster-then-learn algorithms. We use a novel construction called 2-JAO MDP for proving the instance-specific lower bound. The lower bounds are complemented with a polynomial time algorithm that obtains $\tilde{O}(\frac{K}{λ^2})$ sample complexity guarantee for the clustering phase and $\tilde{O}(\sqrt{MK})$ regret guarantee for the learning phase, indicating that the dependency on $K$ and $\frac{1}{λ^2}$ is tight.

View on arXiv PDF Code

Similar