LG AI CR MLNov 1, 2022

Adversarial Policies Beat Superhuman Go AIs

Tony T. Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D. Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell

Berkeley

arXiv:2211.00241v419.237 citationsh-index: 166Has Code

Originality Highly original

AI Analysis

This reveals surprising failure modes in superhuman AI systems, which is a critical problem for AI safety and robustness in high-stakes domains like game-playing.

The researchers attacked the superhuman Go AI KataGo by training adversarial policies that trick it into making blunders, achieving over 97% win rates, and found the vulnerability persists even with defensive training.

We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies against it, achieving a >97% win rate against KataGo running at superhuman settings. Our adversaries do not win by playing Go well. Instead, they trick KataGo into making serious blunders. Our attack transfers zero-shot to other superhuman Go-playing AIs, and is comprehensible to the extent that human experts can implement it without algorithmic assistance to consistently beat superhuman AIs. The core vulnerability uncovered by our attack persists even in KataGo agents adversarially trained to defend against our attack. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available https://goattack.far.ai/.

View on arXiv PDF Code

Similar