Minimax-Optimal Policy Regret in Partially Observable Markov Games
For researchers in reinforcement learning and game theory, this provides the first minimax-optimal policy regret bounds in POMGs, addressing a fundamental challenge of learning under partial observability and strategic adversaries.
The paper tackles policy regret minimization in partially observable Markov games against strategic adversaries. It proves that an epoch-based optimistic maximum-likelihood algorithm achieves Õ(√T) policy regret, with matching lower bounds.
We study sequential decision-making in partially observable environments against strategic, adaptive opponents, modeled as partially observable Markov games (POMGs). The central challenge is to learn latent dynamics from partial observations while facing an adversary whose behavior depends on the learner's strategy, making standard regret notions inadequate. We prove that an epoch-based optimistic maximum-likelihood algorithm achieves $\tilde{O}(\sqrt{T})$ policy regret for fixed problem parameters, with explicit dependence on the horizon, adversary memory, confidence radius, and the aggregate Eluder dimension of the observable-operator class. The algorithm selects one policy per geometrically growing epoch using confidence sets built cumulatively from past data, which keeps the cost of comparing adversary responses across policies logarithmic in $T$. We also prove a lower bound matching the $\sqrt{T}$ and aggregate-Eluder-dimension dependence, up to problem-dependent and logarithmic factors. Finally, we extend the framework to horizon-adaptive guarantees and adversaries with geometric fading memory.