How well does your sampler really work?
This addresses the need for better evaluation methods in machine learning and statistics, offering a more systematic approach than hand-crafted examples, though it is incremental in improving benchmarking practices.
The authors tackled the problem of evaluating MCMC samplers by creating a data-driven benchmark system that generates examples from real datasets and models, providing insights into effective sample size and estimation efficiency with concrete metrics.
We present a new data-driven benchmark system to evaluate the performance of new MCMC samplers. Taking inspiration from the COCO benchmark in optimization, we view this task as having critical importance to machine learning and statistics given the rate at which new samplers are proposed. The common hand-crafted examples to test new samplers are unsatisfactory; we take a meta-learning-like approach to generate benchmark examples from a large corpus of data sets and models. Surrogates of posteriors found in real problems are created using highly flexible density models including modern neural network based approaches. We provide new insights into the real effective sample size of various samplers per unit time and the estimation efficiency of the samplers per sample. Additionally, we provide a meta-analysis to assess the predictive utility of various MCMC diagnostics and perform a nonparametric regression to combine them.