LG AIJul 11, 2023

Scaling Distributed Multi-task Reinforcement Learning with Experience Sharing

Sanae Amani, Khushbu Pahwa, Vladimir Braverman, Lin F. Yang

arXiv:2307.05834v13.81 citationsh-index: 9

Originality Incremental advance

AI Analysis

This work addresses scalable lifelong learning for distributed agents, offering incremental improvements in sample efficiency for multi-task RL.

The paper tackles distributed multi-task reinforcement learning where N agents collaboratively solve M tasks without prior task identities, proposing DistMT-LSVI to achieve ε-optimal policies with sample complexity reduced by a factor of 1/N compared to non-distributed settings, as validated by experiments on Atari environments.

Recently, DARPA launched the ShELL program, which aims to explore how experience sharing can benefit distributed lifelong learning agents in adapting to new challenges. In this paper, we address this issue by conducting both theoretical and empirical research on distributed multi-task reinforcement learning (RL), where a group of $N$ agents collaboratively solves $M$ tasks without prior knowledge of their identities. We approach the problem by formulating it as linearly parameterized contextual Markov decision processes (MDPs), where each task is represented by a context that specifies the transition dynamics and rewards. To tackle this problem, we propose an algorithm called DistMT-LSVI. First, the agents identify the tasks, and then they exchange information through a central server to derive $ε$-optimal policies for the tasks. Our research demonstrates that to achieve $ε$-optimal policies for all $M$ tasks, a single agent using DistMT-LSVI needs to run a total number of episodes that is at most $\tilde{\mathcal{O}}({d^3H^6(ε^{-2}+c_{\rm sep}^{-2})}\cdot M/N)$, where $c_{\rm sep}>0$ is a constant representing task separability, $H$ is the horizon of each episode, and $d$ is the feature dimension of the dynamics and rewards. Notably, DistMT-LSVI improves the sample complexity of non-distributed settings by a factor of $1/N$, as each agent independently learns $ε$-optimal policies for all $M$ tasks using $\tilde{\mathcal{O}}(d^3H^6Mε^{-2})$ episodes. Additionally, we provide numerical experiments conducted on OpenAI Gym Atari environments that validate our theoretical findings.

View on arXiv PDF

Similar