MLJan 27
Revisiting Incremental Stochastic Majorization-Minimization Algorithms with Applications to Mixture of ExpertsTrungKhang Tran, TrungTin Nguyen, Gersende Fort et al.
Processing high-volume, streaming data is increasingly common in modern statistics and machine learning, where batch-mode algorithms are often impractical because they require repeated passes over the full dataset. This has motivated incremental stochastic estimation methods, including the incremental stochastic Expectation-Maximization (EM) algorithm formulated via stochastic approximation. In this work, we revisit and analyze an incremental stochastic variant of the Majorization-Minimization (MM) algorithm, which generalizes incremental stochastic EM as a special case. Our approach relaxes key EM requirements, such as explicit latent-variable representations, enabling broader applicability and greater algorithmic flexibility. We establish theoretical guarantees for the incremental stochastic MM algorithm, proving consistency in the sense that the iterates converge to a stationary point characterized by a vanishing gradient of the objective. We demonstrate these advantages on a softmax-gated mixture of experts (MoE) regression problem, for which no stochastic EM algorithm is available. Empirically, our method consistently outperforms widely used stochastic optimizers, including stochastic gradient descent, root mean square propagation, adaptive moment estimation, and second-order clipped stochastic optimization. These results support the development of new incremental stochastic algorithms, given the central role of softmax-gated MoE architectures in contemporary deep neural networks for heterogeneous data modeling. Beyond synthetic experiments, we also validate practical effectiveness on two real-world datasets, including a bioinformatics study of dent maize genotypes under drought stress that integrates high-dimensional proteomics with ecophysiological traits, where incremental stochastic MM yields stable gains in predictive performance.
MLNov 18, 2024
The Statistical Accuracy of Neural Posterior and Likelihood EstimationDavid T. Frazier, Ryan Kelly, Christopher Drovandi et al.
Neural posterior estimation (NPE) and neural likelihood estimation (NLE) are machine learning approaches that provide accurate posterior, and likelihood, approximations in complex modeling scenarios, and in situations where conducting amortized inference is a necessity. While such methods have shown significant promise across a range of diverse scientific applications, the statistical accuracy of these methods is so far unexplored. In this manuscript, we give, for the first time, an in-depth exploration on the statistical behavior of NPE and NLE. We prove that these methods have similar theoretical guarantees to common statistical methods like approximate Bayesian computation (ABC) and Bayesian synthetic likelihood (BSL). While NPE and NLE methods are just as accurate as ABC and BSL, we prove that this accuracy can often be achieved at a vastly reduced computational cost, and will therefore deliver more attractive approximations than ABC and BSL in certain problems. We verify our results theoretically and in several examples from the literature.
MEMar 16, 2025
Simulation-based Bayesian inference under model misspecificationRyan P. Kelly, David J. Warne, David T. Frazier et al.
Simulation-based Bayesian inference (SBI) methods are widely used for parameter estimation in complex models where evaluating the likelihood is challenging but generating simulations is relatively straightforward. However, these methods commonly assume that the simulation model accurately reflects the true data-generating process, an assumption that is frequently violated in realistic scenarios. In this paper, we focus on the challenges faced by SBI methods under model misspecification. We consolidate recent research aimed at mitigating the effects of misspecification, highlighting three key strategies: i) robust summary statistics, ii) generalised Bayesian inference, and iii) error modelling and adjustment parameters. To illustrate both the vulnerabilities of popular SBI methods and the effectiveness of misspecification-robust alternatives, we present empirical results on an illustrative example.
MLApr 21, 2024
Preconditioned Neural Posterior Estimation for Likelihood-free InferenceXiaoyu Wang, Ryan P. Kelly, David J. Warne et al.
Simulation based inference (SBI) methods enable the estimation of posterior distributions when the likelihood function is intractable, but where model simulation is feasible. Popular neural approaches to SBI are the neural posterior estimator (NPE) and its sequential version (SNPE). These methods can outperform statistical SBI approaches such as approximate Bayesian computation (ABC), particularly for relatively small numbers of model simulations. However, we show in this paper that the NPE methods are not guaranteed to be highly accurate, even on problems with low dimension. In such settings the posterior cannot be accurately trained over the prior predictive space, and even the sequential extension remains sub-optimal. To overcome this, we propose preconditioned NPE (PNPE) and its sequential version (PSNPE), which uses a short run of ABC to effectively eliminate regions of parameter space that produce large discrepancy between simulations and data and allow the posterior emulator to be more accurately trained. We present comprehensive empirical evidence that this melding of neural and statistical SBI methods improves performance over a range of examples, including a motivating example involving a complex agent-based model applied to real tumour growth data.
MLMay 19, 2025
Model Selection for Gaussian-gated Gaussian Mixture of Experts Using Dendrograms of Mixing MeasuresTuan Thai, TrungTin Nguyen, Dat Do et al.
Mixture of Experts (MoE) models constitute a widely utilized class of ensemble learning approaches in statistics and machine learning, known for their flexibility and computational efficiency. They have become integral components in numerous state-of-the-art deep neural network architectures, particularly for analyzing heterogeneous data across diverse domains. Despite their practical success, the theoretical understanding of model selection, especially concerning the optimal number of mixture components or experts, remains limited and poses significant challenges. These challenges primarily stem from the inclusion of covariates in both the Gaussian gating functions and expert networks, which introduces intrinsic interactions governed by partial differential equations with respect to their parameters. In this paper, we revisit the concept of dendrograms of mixing measures and introduce a novel extension to Gaussian-gated Gaussian MoE models that enables consistent estimation of the true number of mixture components and achieves the pointwise optimal convergence rate for parameter estimation in overfitted scenarios. Notably, this approach circumvents the need to train and compare a range of models with varying numbers of components, thereby alleviating the computational burden, particularly in high-dimensional or deep neural network settings. Experimental results on synthetic data demonstrate the effectiveness of the proposed method in accurately recovering the number of experts. It outperforms common criteria such as the Akaike information criterion, the Bayesian information criterion, and the integrated completed likelihood, while achieving optimal convergence rates for parameter estimation and accurately approximating the regression function.
MLOct 14, 2025
Dendrograms of Mixing Measures for Softmax-Gated Gaussian Mixture of Experts: Consistency without Model SweepsDo Tien Hai, Trung Nguyen Mai, TrungTin Nguyen et al.
We develop a unified statistical framework for softmax-gated Gaussian mixture of experts (SGMoE) that addresses three long-standing obstacles in parameter estimation and model selection: (i) non-identifiability of gating parameters up to common translations, (ii) intrinsic gate-expert interactions that induce coupled differential relations in the likelihood, and (iii) the tight numerator-denominator coupling in the softmax-induced conditional density. Our approach introduces Voronoi-type loss functions aligned with the gate-partition geometry and establishes finite-sample convergence rates for the maximum likelihood estimator (MLE). In over-specified models, we reveal a link between the MLE's convergence rate and the solvability of an associated system of polynomial equations characterizing near-nonidentifiable directions. For model selection, we adapt dendrograms of mixing measures to SGMoE, yielding a consistent, sweep-free selector of the number of experts that attains pointwise-optimal parameter rates under overfitting while avoiding multi-size training. Simulations on synthetic data corroborate the theory, accurately recovering the expert count and achieving the predicted rates for parameter estimation while closely approximating the regression function. Under model misspecification (e.g., $ε$-contamination), the dendrogram selection criterion is robust, recovering the true number of mixture components, while the Akaike information criterion, the Bayesian information criterion, and the integrated completed likelihood tend to overselect as sample size grows. On a maize proteomics dataset of drought-responsive traits, our dendrogram-guided SGMoE selects two experts, exposes a clear mixing-measure hierarchy, stabilizes the likelihood early, and yields interpretable genotype-phenotype maps, outperforming standard criteria without multi-size training.
MEMay 25, 2025
A Unified Framework for Variable Selection in Model-Based Clustering with Missing Not at RandomBinh H. Ho, Long Nguyen Chi, TrungTin Nguyen et al.
Model-based clustering integrated with variable selection is a powerful tool for uncovering latent structures within complex data. However, its effectiveness is often hindered by challenges such as identifying relevant variables that define heterogeneous subgroups and handling data that are missing not at random, a prevalent issue in fields like transcriptomics. While several notable methods have been proposed to address these problems, they typically tackle each issue in isolation, thereby limiting their flexibility and adaptability. This paper introduces a unified framework designed to address these challenges simultaneously. Our approach incorporates a data-driven penalty matrix into penalized clustering to enable more flexible variable selection, along with a mechanism that explicitly models the relationship between missingness and latent class membership. We demonstrate that, under certain regularity conditions, the proposed framework achieves both asymptotic consistency and selection consistency, even in the presence of missing data. This unified strategy significantly enhances the capability and efficiency of model-based clustering, advancing methodologies for identifying informative variables that define homogeneous subgroups in the presence of complex missing data patterns. The performance of the framework, including its computational efficiency, is evaluated through simulations and demonstrated using both synthetic and real-world transcriptomic datasets.
MLDec 6, 2024
The Polynomial Stein Discrepancy for Assessing Moment ConvergenceNarayan Srinivasan, Matthew Sutton, Christopher Drovandi et al.
We propose a novel method for measuring the discrepancy between a set of samples and a desired posterior distribution for Bayesian inference. Classical methods for assessing sample quality like the effective sample size are not appropriate for scalable Bayesian sampling algorithms, such as stochastic gradient Langevin dynamics, that are asymptotically biased. Instead, the gold standard is to use the kernel Stein Discrepancy (KSD), which is itself not scalable given its quadratic cost in the number of samples. The KSD and its faster extensions also typically suffer from the curse-of-dimensionality and can require extensive tuning. To address these limitations, we develop the polynomial Stein discrepancy (PSD) and an associated goodness-of-fit test. While the new test is not fully convergence-determining, we prove that it detects differences in the first r moments in the Bernstein-von Mises limit. We empirically show that the test has higher power than its competitors in several examples, and at a lower computational cost. Finally, we demonstrate that the PSD can assist practitioners to select hyper-parameters of Bayesian sampling algorithms more efficiently than competitors.
MLMar 20, 2020
Sequential Bayesian Experimental Design for Implicit Models via Mutual InformationSteven Kleinegesse, Christopher Drovandi, Michael U. Gutmann
Bayesian experimental design (BED) is a framework that uses statistical models and decision making under uncertainty to optimise the cost and performance of a scientific experiment. Sequential BED, as opposed to static BED, considers the scenario where we can sequentially update our beliefs about the model parameters through data gathered in the experiment. A class of models of particular interest for the natural and medical sciences are implicit models, where the data generating distribution is intractable, but sampling from it is possible. Even though there has been a lot of work on static BED for implicit models in the past few years, the notoriously difficult problem of sequential BED for implicit models has barely been touched upon. We address this gap in the literature by devising a novel sequential design framework for parameter estimation that uses the Mutual Information (MI) between model parameters and simulated data as a utility function to find optimal experimental designs, which has not been done before for implicit models. Our approach uses likelihood-free inference by ratio estimation to simultaneously estimate posterior distributions and the MI. During the sequential BED procedure we utilise Bayesian optimisation to help us optimise the MI utility. We find that our framework is efficient for the various implicit models tested, yielding accurate parameter estimates after only a few iterations.
COSep 11, 2019
Efficient Bayesian synthetic likelihood with whitening transformationsJacob W. Priddle, Scott A. Sisson, David T. Frazier et al.
Likelihood-free methods are an established approach for performing approximate Bayesian inference for models with intractable likelihood functions. However, they can be computationally demanding. Bayesian synthetic likelihood (BSL) is a popular such method that approximates the likelihood function of the summary statistic with a known, tractable distribution -- typically Gaussian -- and then performs statistical inference using standard likelihood-based techniques. However, as the number of summary statistics grows, the number of model simulations required to accurately estimate the covariance matrix for this likelihood rapidly increases. This poses significant challenge for the application of BSL, especially in cases where model simulation is expensive. In this article we propose whitening BSL (wBSL) -- an efficient BSL method that uses approximate whitening transformations to decorrelate the summary statistics at each algorithm iteration. We show empirically that this can reduce the number of model simulations required to implement BSL by more than an order of magnitude, without much loss of accuracy. We explore a range of whitening procedures and demonstrate the performance of wBSL on a range of simulated and real modelling scenarios from ecology and biology.
COJul 25, 2019
BSL: An R Package for Efficient Parameter Estimation for Simulation-Based Models via Bayesian Synthetic LikelihoodZiwen An, Leah F South, Christopher Drovandi
Bayesian synthetic likelihood (BSL) is a popular method for estimating the parameter posterior distribution for complex statistical models and stochastic processes that possess a computationally intractable likelihood function. Instead of evaluating the likelihood, BSL approximates the likelihood of a judiciously chosen summary statistic of the data via model simulation and density estimation. Compared to alternative methods such as approximate Bayesian computation (ABC), BSL requires little tuning and requires less model simulations than ABC when the chosen summary statistic is high-dimensional. The original synthetic likelihood relies on a multivariate normal approximation of the intractable likelihood, where the mean and covariance are estimated by simulation. An extension of BSL considers replacing the sample covariance with a penalised covariance estimator to reduce the number of required model simulations. Further, a semi-parametric approach has been developed to relax the normality assumption. In this paper, we present an R package called BSL that amalgamates the aforementioned methods and more into a single, easy-to-use and coherent piece of software. The R package also includes several examples to illustrate how to use the package and demonstrate the utility of the methods.
COFeb 25, 2019
Vector operations for accelerating expensive Bayesian computations -- a tutorial guideDavid J. Warne, Scott A. Sisson, Christopher Drovandi
Many applications in Bayesian statistics are extremely computationally intensive. However, they are often inherently parallel, making them prime targets for modern massively parallel processors. Multi-core and distributed computing is widely applied in the Bayesian community, however, very little attention has been given to fine-grain parallelisation using single instruction multiple data (SIMD) operations that are available on most modern commodity CPUs and is the basis of GPGPU computing. In this work, we practically demonstrate, using standard programming libraries, the utility of the SIMD approach for several topical Bayesian applications. We show that SIMD can improve the floating point arithmetic performance resulting in up to $6\times$ improvement in serial algorithm performance. Importantly, these improvements are multiplicative to any gains achieved through multi-core processing. We illustrate the potential of SIMD for accelerating Bayesian computations and provide the reader with techniques for exploiting modern massively parallel processing environments using standard tools.