Model-free policy gradient for discrete-time mean-field control
This work addresses a gap in policy-based methods for MFC, which is important for applications involving large populations, but it appears incremental as it builds on existing value-based MFC literature.
The authors tackled the problem of model-free policy learning for discrete-time mean-field control (MFC) by introducing a novel perturbation scheme to estimate policy gradients, leading to the development of the MF-REINFORCE algorithm with proven bias and error bounds, which demonstrated effectiveness in numerical experiments.
We study model-free policy learning for discrete-time mean-field control (MFC) problems with finite state space and compact action space. In contrast to the extensive literature on value-based methods for MFC, policy-based approaches remain largely unexplored due to the intrinsic dependence of transition kernels and rewards on the evolving population state distribution, which prevents the direct use of likelihood-ratio estimators of policy gradients from classical single-agent reinforcement learning. We introduce a novel perturbation scheme on the state-distribution flow and prove that the gradient of the resulting perturbed value function converges to the true policy gradient as the perturbation magnitude vanishes. This construction yields a fully model-free estimator based solely on simulated trajectories and an auxiliary estimate of the sensitivity of the state distribution. Building on this framework, we develop MF-REINFORCE, a model-free policy gradient algorithm for MFC, and establish explicit quantitative bounds on its bias and mean-squared error. Numerical experiments on representative mean-field control tasks demonstrate the effectiveness of the proposed approach.