LGAIMay 9

Revisiting Mixture Policies in Entropy-Regularized Actor-Critic

arXiv:2605.0915752.6
AI Analysis

For researchers in continuous-action reinforcement learning, this work provides a practical method to leverage mixture policies, which were previously hindered by high-variance gradient estimation.

The paper investigates why mixture policies are rarely used in state-of-the-art actor-critic algorithms despite theoretical advantages, identifies the lack of a low-variance reparameterization trick as a core issue, and proposes a marginalized reparameterization (MRP) estimator that reduces variance. Experiments show MRP mixture policies match or outperform Gaussian policies across several continuous control benchmarks.

Mixture policies theoretically offer greater flexibility than unimodal policies in continuous action reinforcement learning, but the practical benefits of this complexity remain elusive. Mixture policies are notably absent from most state-of-the-art algorithms, raising a fundamental question: Is the added representational overhead useful? We show that increased flexibility can theoretically enhance solution quality and entropy robustness. Yet standard algorithms like SAC do not leverage these advantages. A core issue is the lack of a low-variance reparameterization trick for mixtures, a luxury Gaussian policies enjoy. We propose a marginalized reparameterization (MRP) estimator to address this, proving it offers lower variance than the standard likelihood-ratio (LR) approach. Our experiments across Gym MuJoCo, DeepMind Control Suite, and MetaWorld show that MRP mixture policies significantly outperform their LR ones, and reach parity (sometimes better) with Gaussian counterparts. In addition, we do find several cases where MRP mixture policies exhibit clear empirical advantages. In this paper, we provide a clearer understanding of the trade-offs involved, elevating MRP mixture policies from theoretical curiosity to a practical tool.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes