LGMar 28

Scalable Maximum Entropy Population Synthesis via Persistent Contrastive Divergence

arXiv:2603.2731223.6h-index: 18
Predicted impact top 80% in LG · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners of agent-based modeling and urban simulation, this work provides a scalable method to generate synthetic populations from aggregate data without access to microdata, overcoming a key computational limitation.

The paper tackles the computational bottleneck of maximum entropy population synthesis, which requires exact expectation computation over an infeasibly large tuple space. The proposed GibbsPCDSolver uses persistent contrastive divergence to achieve mean relative error between 0.010 and 0.018 across up to 50 attributes, with runtime scaling linearly in K rather than exponentially, and produces populations with effective sample size equal to N, an 86.8× diversity advantage over generalized raking.

Maximum entropy (MaxEnt) modelling provides a principled framework for generating synthetic populations from aggregate census data, without access to individual-level microdata. The bottleneck of existing approaches is exact expectation computation, which requires summing over the full tuple space $\cX$ and becomes infeasible for more than $K \approx 20$ categorical attributes. We propose \emph{GibbsPCDSolver}, a stochastic replacement for this computation based on Persistent Contrastive Divergence (PCD): a persistent pool of $N$ synthetic individuals is updated by Gibbs sweeps at each gradient step, providing a stochastic approximation of the model expectations without ever materialising $\cX$. We validate the approach on controlled benchmarks and on \emph{Syn-ISTAT}, a $K{=}15$ Italian demographic benchmark with analytically exact marginal targets derived from ISTAT-inspired conditional probability tables. Scaling experiments across $K \in \{12, 20, 30, 40, 50\}$ confirm that GibbsPCDSolver maintains $\MRE \in [0.010, 0.018]$ while $|\cX|$ grows eighteen orders of magnitude, with runtime scaling as $O(K)$ rather than $O(|\cX|)$. On Syn-ISTAT, GibbsPCDSolver reaches $\MRE{=}0.03$ on training constraints and -- crucially -- produces populations with effective sample size $\Neff = N$ versus $\Neff \approx 0.012\,N$ for generalised raking, an $86.8{\times}$ diversity advantage that is essential for agent-based urban simulations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes