LGJun 7, 2023

Policy-Based Self-Competition for Planning Problems

arXiv:2306.04403v15 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses stagnation in single-player planning for combinatorial optimization, though it is incremental as it builds on existing self-competition and GAZ methods.

The paper tackled the problem of AlphaZero-type algorithms stagnating in single-player tasks by introducing GAZ 'Play-to-Plan' (GAZ PTP), which incorporates historical policies into planning against past strategies, resulting in consistent outperformance of baseline GAZ variants with only half the simulation budget in combinatorial optimization problems like the Traveling Salesman and Job-Shop Scheduling.

AlphaZero-type algorithms may stop improving on single-player tasks in case the value network guiding the tree search is unable to approximate the outcome of an episode sufficiently well. One technique to address this problem is transforming the single-player task through self-competition. The main idea is to compute a scalar baseline from the agent's historical performances and to reshape an episode's reward into a binary output, indicating whether the baseline has been exceeded or not. However, this baseline only carries limited information for the agent about strategies how to improve. We leverage the idea of self-competition and directly incorporate a historical policy into the planning process instead of its scalar performance. Based on the recently introduced Gumbel AlphaZero (GAZ), we propose our algorithm GAZ 'Play-to-Plan' (GAZ PTP), in which the agent learns to find strong trajectories by planning against possible strategies of its past self. We show the effectiveness of our approach in two well-known combinatorial optimization problems, the Traveling Salesman Problem and the Job-Shop Scheduling Problem. With only half of the simulation budget for search, GAZ PTP consistently outperforms all selected single-player variants of GAZ.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes