Provably Efficient Offline Multi-agent Reinforcement Learning via Strategy-wise Bonus
It addresses the sample efficiency problem in offline multi-agent RL for researchers and practitioners, offering a novel approach that is not incremental but provides foundational improvements.
This paper tackles offline multi-agent reinforcement learning by proposing a strategy-wise concentration principle to avoid the curse of multiagents, resulting in algorithms with sample complexities scaling linearly with the sum of action sizes instead of exponentially with the joint action space, e.g., scaling with ∑_{i=1}^m A_i rather than Π_{i=1}^m A_i.
This paper considers offline multi-agent reinforcement learning. We propose the strategy-wise concentration principle which directly builds a confidence interval for the joint strategy, in contrast to the point-wise concentration principle that builds a confidence interval for each point in the joint action space. For two-player zero-sum Markov games, by exploiting the convexity of the strategy-wise bonus, we propose a computationally efficient algorithm whose sample complexity enjoys a better dependency on the number of actions than the prior methods based on the point-wise bonus. Furthermore, for offline multi-agent general-sum Markov games, based on the strategy-wise bonus and a novel surrogate function, we give the first algorithm whose sample complexity only scales $\sum_{i=1}^mA_i$ where $A_i$ is the action size of the $i$-th player and $m$ is the number of players. In sharp contrast, the sample complexity of methods based on the point-wise bonus would scale with the size of the joint action space $Π_{i=1}^m A_i$ due to the curse of multiagents. Lastly, all of our algorithms can naturally take a pre-specified strategy class $Π$ as input and output a strategy that is close to the best strategy in $Π$. In this setting, the sample complexity only scales with $\log |Π|$ instead of $\sum_{i=1}^mA_i$.