LG AI GT MAOct 7, 2021

Online Markov Decision Processes with Non-oblivious Strategic Adversary

Le Cong Dinh, David Henry Mguni, Long Tran-Thanh, Jun Wang, Yaodong Yang

arXiv:2110.03604v33.17 citations

Originality Incremental advance

AI Analysis

This work addresses strategic decision-making in online environments for applications like game theory and AI, with incremental improvements in regret bounds and convergence results.

The paper tackles the problem of Online Markov Decision Processes with a non-oblivious strategic adversary, showing that an existing algorithm achieves a policy regret bound of O(√(T log(L)) + τ²√(T log(|A|))) and proposing a new algorithm, MDP-OOE, that improves this to O(√(T log(L)) + τ²√(T k log(k))) for games with small Nash equilibrium support, while also introducing an algorithm for last-round convergence to Nash equilibrium.

We study a novel setting in Online Markov Decision Processes (OMDPs) where the loss function is chosen by a non-oblivious strategic adversary who follows a no-external regret algorithm. In this setting, we first demonstrate that MDP-Expert, an existing algorithm that works well with oblivious adversaries can still apply and achieve a policy regret bound of $\mathcal{O}(\sqrt{T \log(L)}+τ^2\sqrt{ T \log(|A|)})$ where $L$ is the size of adversary's pure strategy set and $|A|$ denotes the size of agent's action space. Considering real-world games where the support size of a NE is small, we further propose a new algorithm: MDP-Online Oracle Expert (MDP-OOE), that achieves a policy regret bound of $\mathcal{O}(\sqrt{T\log(L)}+τ^2\sqrt{ T k \log(k)})$ where $k$ depends only on the support size of the NE. MDP-OOE leverages the key benefit of Double Oracle in game theory and thus can solve games with prohibitively large action space. Finally, to better understand the learning dynamics of no-regret methods, under the same setting of no-external regret adversary in OMDPs, we introduce an algorithm that achieves last-round convergence result to a NE. To our best knowledge, this is first work leading to the last iteration result in OMDPs.

View on arXiv PDF

Similar