Policy Gradient with Tree Search: Avoiding Local Optimas through Lookahead
This addresses a fundamental challenge in reinforcement learning for practitioners dealing with complex environments, though it is an incremental improvement over existing methods.
The paper tackles the problem of policy gradient methods converging to suboptimal local optima in reinforcement learning by integrating an m-step lookahead tree search, resulting in improved worst-case performance and empirical success in escaping local traps across diverse environments.
Classical policy gradient (PG) methods in reinforcement learning frequently converge to suboptimal local optima, a challenge exacerbated in large or complex environments. This work investigates Policy Gradient with Tree Search (PGTS), an approach that integrates an $m$-step lookahead mechanism to enhance policy optimization. We provide theoretical analysis demonstrating that increasing the tree search depth $m$-monotonically reduces the set of undesirable stationary points and, consequently, improves the worst-case performance of any resulting stationary policy. Critically, our analysis accommodates practical scenarios where policy updates are restricted to states visited by the current policy, rather than requiring updates across the entire state space. Empirical evaluations on diverse MDP structures, including Ladder, Tightrope, and Gridworld environments, illustrate PGTS's ability to exhibit "farsightedness," navigate challenging reward landscapes, escape local traps where standard PG fails, and achieve superior solutions.