Communication Efficient Parallel Reinforcement Learning
This work addresses communication efficiency in distributed reinforcement learning, which is incremental as it builds on existing methods like UCRL2 to reduce synchronization costs.
The paper tackles the problem of minimizing regret in parallel reinforcement learning with multiple agents interacting independently, while reducing communication rounds with a central server. The proposed algorithm achieves a total cumulative regret bound of $ ilde{O}(DS\sqrt{MAT})$ and reduces communication rounds to $O(MSA\log(MT))$, performing comparably to an always-communicating baseline with significantly lower communication.
We consider the problem where $M$ agents interact with $M$ identical and independent environments with $S$ states and $A$ actions using reinforcement learning for $T$ rounds. The agents share their data with a central server to minimize their regret. We aim to find an algorithm that allows the agents to minimize the regret with infrequent communication rounds. We provide \NAM\ which runs at each agent and prove that the total cumulative regret of $M$ agents is upper bounded as $\Tilde{O}(DS\sqrt{MAT})$ for a Markov Decision Process with diameter $D$, number of states $S$, and number of actions $A$. The agents synchronize after their visitations to any state-action pair exceeds a certain threshold. Using this, we obtain a bound of $O\left(MSA\log(MT)\right)$ on the total number of communications rounds. Finally, we evaluate the algorithm against multiple environments and demonstrate that the proposed algorithm performs at par with an always communication version of the UCRL2 algorithm, while with significantly lower communication.