AI MAJul 17, 2021

Communicating via Markov Decision Processes

Samuel Sokota, Christian Schroeder de Witt, Maximilian Igl, Luisa Zintgraf, Philip Torr, Martin Strohmeier, J. Zico Kolter, Shimon Whiteson, Jakob Foerster

arXiv:2107.08295v213.815 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses a problem in decentralized control settings where cheap-talk is unavailable, offering a practical solution for communication via MDPs, though it appears incremental as it builds on existing reinforcement learning and coupling techniques.

The paper tackles the problem of communicating information through Markov decision process trajectories, called Markov coding games, by proposing a method called MEME that balances communication with its cost. The result shows MEME outperforms baselines on small games and achieves strong performance on large games, such as losslessly communicating binary images via Cartpole and Pong trajectories while maintaining high expected returns.

We consider the problem of communicating exogenous information by means of Markov decision process trajectories. This setting, which we call a Markov coding game (MCG), generalizes both source coding and a large class of referential games. MCGs also isolate a problem that is important in decentralized control settings in which cheap-talk is not available -- namely, they require balancing communication with the associated cost of communicating. We contribute a theoretically grounded approach to MCGs based on maximum entropy reinforcement learning and minimum entropy coupling that we call MEME. Due to recent breakthroughs in approximation algorithms for minimum entropy coupling, MEME is not merely a theoretical algorithm, but can be applied to practical settings. Empirically, we show both that MEME is able to outperform a strong baseline on small MCGs and that MEME is able to achieve strong performance on extremely large MCGs. To the latter point, we demonstrate that MEME is able to losslessly communicate binary images via trajectories of Cartpole and Pong, while simultaneously achieving the maximal or near maximal expected returns, and that it is even capable of performing well in the presence of actuator noise.

View on arXiv PDF Code

Similar