Near-Optimal Reinforcement Learning with Self-Play under Adaptivity Constraints
This work addresses the problem of minimizing policy updates in real-world multi-agent reinforcement learning applications.
The authors tackled the problem of multi-agent reinforcement learning with adaptivity constraints, achieving a regret of $widetilde{O}(sqrt{H^3 S^2 ABK})$ and a batch complexity of $O(H+loglog K)$. They also proved a batch complexity lower bound of $Ω(rac{H}{log_{A}K}+loglog K)$ for all algorithms with $widetilde{O}(sqrt{K})$ regret bound.
We study the problem of multi-agent reinforcement learning (MARL) with adaptivity constraints -- a new problem motivated by real-world applications where deployments of new policies are costly and the number of policy updates must be minimized. For two-player zero-sum Markov Games, we design a (policy) elimination based algorithm that achieves a regret of $\widetilde{O}(\sqrt{H^3 S^2 ABK})$, while the batch complexity is only $O(H+\log\log K)$. In the above, $S$ denotes the number of states, $A,B$ are the number of actions for the two players respectively, $H$ is the horizon and $K$ is the number of episodes. Furthermore, we prove a batch complexity lower bound $Ω(\frac{H}{\log_{A}K}+\log\log K)$ for all algorithms with $\widetilde{O}(\sqrt{K})$ regret bound, which matches our upper bound up to logarithmic factors. As a byproduct, our techniques naturally extend to learning bandit games and reward-free MARL within near optimal batch complexity. To the best of our knowledge, these are the first line of results towards understanding MARL with low adaptivity.