LGMLDec 16, 2019

Self-Play Learning Without a Reward Metric

arXiv:1912.07557v14 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of reward function design for users of self-play algorithms in strategy games, though it appears incremental as it builds directly on AlphaZero.

The paper tackles the problem of needing a quantitative reward function in AlphaZero's self-play learning by introducing a modification that only requires a total ordering over game outcomes, eliminating the need for reward balancing. The result shows that this system learns optimal play in a comparable amount of time to AlphaZero on a sample game.

The AlphaZero algorithm for the learning of strategy games via self-play, which has produced superhuman ability in the games of Go, chess, and shogi, uses a quantitative reward function for game outcomes, requiring the users of the algorithm to explicitly balance different components of the reward against each other, such as the game winner and margin of victory. We present a modification to the AlphaZero algorithm that requires only a total ordering over game outcomes, obviating the need to perform any quantitative balancing of reward components. We demonstrate that this system learns optimal play in a comparable amount of time to AlphaZero on a sample game.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes