LGAICRMLNov 1, 2022

Adversarial Policies Beat Superhuman Go AIs

Berkeley
arXiv:2211.00241v435 citationsh-index: 166
Originality Highly original
AI Analysis

This reveals surprising failure modes in superhuman AI systems, which is a critical problem for AI safety and robustness in high-stakes domains like game-playing.

The researchers attacked the superhuman Go AI KataGo by training adversarial policies that trick it into making blunders, achieving over 97% win rates, and found the vulnerability persists even with defensive training.

We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies against it, achieving a >97% win rate against KataGo running at superhuman settings. Our adversaries do not win by playing Go well. Instead, they trick KataGo into making serious blunders. Our attack transfers zero-shot to other superhuman Go-playing AIs, and is comprehensible to the extent that human experts can implement it without algorithmic assistance to consistently beat superhuman AIs. The core vulnerability uncovered by our attack persists even in KataGo agents adversarially trained to defend against our attack. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available https://goattack.far.ai/.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes