LGAIDec 25, 2025

Variance-Aware Prior-Based Tree Policies for Monte Carlo Tree Search

arXiv:2512.21648v1Has Code
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in reinforcement learning for tasks requiring long-horizon reasoning, offering incremental improvements to MCTS algorithms.

The paper tackled the challenge of extending prior-based tree policies in Monte Carlo Tree Search (MCTS) beyond the empirically derived PUCT by introducing Inverse-RPO, a method to systematically derive such policies from any prior-free UCB, specifically applying it to UCB-V to create variance-aware policies. The result showed that these new policies outperformed PUCT across multiple benchmarks without extra computational cost.

Monte Carlo Tree Search (MCTS) has profoundly influenced reinforcement learning (RL) by integrating planning and learning in tasks requiring long-horizon reasoning, exemplified by the AlphaZero family of algorithms. Central to MCTS is the search strategy, governed by a tree policy based on an upper confidence bound (UCB) applied to trees (UCT). A key factor in the success of AlphaZero is the introduction of a prior term in the UCB1-based tree policy PUCT, which improves exploration efficiency and thus accelerates training. While many alternative UCBs with stronger theoretical guarantees than UCB1 exist, extending them to prior-based UCTs has been challenging, since PUCT was derived empirically rather than from first principles. Recent work retrospectively justified PUCT by framing MCTS as a regularized policy optimization (RPO) problem. Building on this perspective, we introduce Inverse-RPO, a general methodology that systematically derives prior-based UCTs from any prior-free UCB. Applying this method to the variance-aware UCB-V, we obtain two new prior-based tree policies that incorporate variance estimates into the search. Experiments indicate that these variance-aware prior-based UCTs outperform PUCT across multiple benchmarks without incurring additional computational cost. We also provide an extension of the mctx library supporting variance-aware UCTs, showing that the required code changes are minimal and intended to facilitate further research on principled prior-based UCTs. Code: github.com/Max-We/inverse-rpo.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes