LGAIMay 7, 2024

Super-Exponential Regret for UCT, AlphaGo and Variants

arXiv:2405.04407v23 citationsh-index: 22
Originality Synthesis-oriented
AI Analysis

This work is incremental, as it improves and extends existing lower bound proofs for regret in Monte Carlo tree search algorithms, which is relevant for researchers in reinforcement learning and game AI.

The paper tackles the problem of proving super-exponential regret for UCT and AlphaGo variants on the D-chain environment, fixing an oversight in prior proofs and extending the results to show that these algorithms can have extremely high regret, specifically exp2(exp2(D - O(log D))) regret.

We improve the proofs of the lower bounds of Coquelin and Munos (2007) that demonstrate that UCT can have $\exp(\dots\exp(1)\dots)$ regret (with $Ω(D)$ exp terms) on the $D$-chain environment, and that a `polynomial' UCT variant has $\exp_2(\exp_2(D - O(\log D)))$ regret on the same environment -- the original proofs contain an oversight for rewards bounded in $[0, 1]$, which we fix in the present draft. We also adapt the proofs to AlphaGo's MCTS and its descendants (e.g., AlphaZero, Leela Zero) to also show $\exp_2(\exp_2(D - O(\log D)))$ regret.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes