AILGMAJan 27, 2023

Policy-Value Alignment and Robustness in Search-based Multi-Agent Learning

arXiv:2301.11857v2h-index: 69
Originality Incremental advance
AI Analysis

This addresses robustness issues in AI systems for real-world deployment, though it is incremental as it builds on AlphaZero.

The paper tackled the brittleness of search-based multi-agent learning systems like AlphaZero by identifying policy-value misalignment and value inconsistency, and introduced VISA-VIS to reduce misalignment by up to 76% and value errors by up to 55%.

Large-scale AI systems that combine search and learning have reached super-human levels of performance in game-playing, but have also been shown to fail in surprising ways. The brittleness of such models limits their efficacy and trustworthiness in real-world deployments. In this work, we systematically study one such algorithm, AlphaZero, and identify two phenomena related to the nature of exploration. First, we find evidence of policy-value misalignment -- for many states, AlphaZero's policy and value predictions contradict each other, revealing a tension between accurate move-selection and value estimation in AlphaZero's objective. Further, we find inconsistency within AlphaZero's value function, which causes it to generalize poorly, despite its policy playing an optimal strategy. From these insights we derive VISA-VIS: a novel method that improves policy-value alignment and value robustness in AlphaZero. Experimentally, we show that our method reduces policy-value misalignment by up to 76%, reduces value generalization error by up to 50%, and reduces average value error by up to 55%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes