LGMANov 27, 2025

High entropy leads to symmetry equivariant policies in Dec-POMDPs

arXiv:2511.22581v21 citations
Originality Highly original
AI Analysis

This addresses the problem of policy incompatibility in multi-agent systems for researchers and practitioners, though it is incremental with limitations.

The paper proves that high entropy regularization in Dec-POMDPs ensures convergence to symmetric policies, enabling compatibility across independently trained policies, and empirically shows this improves cross-play returns, achieving a new SOTA in Hanabi.

We prove that in any Dec-POMDP, sufficiently high entropy regularization ensures that policy gradient ascent with tabular softmax parametrization always converges, for any initialization, to the same joint policy, and that this joint policy is equivariant w.r.t. all symmetries of the Dec-POMDP. In particular, policies coming from different random seeds will be fully compatible, in that their cross-play returns are equal to their self-play returns. Through extensive empirical evaluation of independent PPO in the Hanabi, Overcooked, and Yokai environments, we find that the entropy coefficient has a massive influence on the cross-play returns between independently trained policies, and that the drop in self-play returns coming from increased entropy regularization can often be counteracted by greedifying the learned policies after training. In Hanabi we achieve a new SOTA in inter-seed cross-play this way. Despite clear limitations of this recipe, which we point out, both our theoretical and empirical results indicate that during hyperparameter sweeps in Dec-POMDPs, one should consider far higher entropy coefficients than is typically done.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes