IMAug 30, 2019Code
Conditional Density Estimation Tools in Python and R with Applications to Photometric Redshifts and Likelihood-Free Cosmological InferenceNiccolò Dalmasso, Taylor Pospisil, Ann B. Lee et al.
It is well known in astronomy that propagating non-Gaussian prediction uncertainty in photometric redshift estimates is key to reducing bias in downstream cosmological analyses. Similarly, likelihood-free inference approaches, which are beginning to emerge as a tool for cosmological analysis, require a characterization of the full uncertainty landscape of the parameters of interest given observed data. However, most machine learning (ML) or training-based methods with open-source software target point prediction or classification, and hence fall short in quantifying uncertainty in complex regression and parameter inference settings. As an alternative to methods that focus on predicting the response (or parameters) $\mathbf{y}$ from features $\mathbf{x}$, we provide nonparametric conditional density estimation (CDE) tools for approximating and validating the entire probability density function (PDF) $\mathrm{p}(\mathbf{y}|\mathbf{x})$ of $\mathbf{y}$ given (i.e., conditional on) $\mathbf{x}$. As there is no one-size-fits-all CDE method, the goal of this work is to provide a comprehensive range of statistical tools and open-source software for nonparametric CDE and method assessment which can accommodate different types of settings and be easily fit to the problem at hand. Specifically, we introduce four CDE software packages in $\texttt{Python}$ and $\texttt{R}$ based on ML prediction methods adapted and optimized for CDE: $\texttt{NNKCDE}$, $\texttt{RFCDE}$, $\texttt{FlexCode}$, and $\texttt{DeepCDE}$. Furthermore, we present the $\texttt{cdetools}$ package, which includes functions for computing a CDE loss function for tuning and assessing the quality of individual PDFs, along with diagnostic functions. We provide sample code in $\texttt{Python}$ and $\texttt{R}$ as well as examples of applications to photometric redshift estimation and likelihood-free cosmological inference via CDE.
MLApr 16, 2018Code
RFCDE: Random Forests for Conditional Density EstimationTaylor Pospisil, Ann B. Lee
Random forests is a common non-parametric regression technique which performs well for mixed-type data and irrelevant covariates, while being robust to monotonic variable transformations. Existing random forest implementations target regression or classification. We introduce the RFCDE package for fitting random forest models optimized for nonparametric conditional density estimation, including joint densities for multiple responses. This enables analysis of conditional probability distributions which is useful for propagating uncertainty and of joint distributions that describe relationships between multiple responses and covariates. RFCDE is released under the MIT open-source license and can be accessed at https://github.com/tpospisi/rfcde . Both R and Python versions, which call a common C++ library, are available.
MEMay 27, 2019
Validation of Approximate Likelihood and Emulator Models for Computationally Intensive SimulationsNiccolò Dalmasso, Ann B. Lee, Rafael Izbicki et al.
Complex phenomena in engineering and the sciences are often modeled with computationally intensive feed-forward simulations for which a tractable analytic likelihood does not exist. In these cases, it is sometimes necessary to estimate an approximate likelihood or fit a fast emulator model for efficient statistical inference; such surrogate models include Gaussian synthetic likelihoods and more recently neural density estimators such as autoregressive models and normalizing flows. To date, however, there is no consistent way of quantifying the quality of such a fit. Here we propose a statistical framework that can distinguish any arbitrary misspecified model from the target likelihood, and that in addition can identify with statistical confidence the regions of parameter as well as feature space where the fit is inadequate. Our validation method applies to settings where simulations are extremely costly and generated in batches or "ensembles" at fixed locations in parameter space. At the heart of our approach is a two-sample test that quantifies the quality of the fit at fixed parameter values, and a global test that assesses goodness-of-fit across simulation parameters. While our general framework can incorporate any test statistic or distance metric, we specifically argue for a new two-sample test that can leverage any regression method to attain high power and provide diagnostics in complex data settings.
MEMay 14, 2018
ABC-CDE: Towards Approximate Bayesian Computation with Complex High-Dimensional Data and Limited SimulationsRafael Izbicki, Ann B. Lee, Taylor Pospisil
Approximate Bayesian Computation (ABC) is typically used when the likelihood is either unavailable or intractable but where data can be simulated under different parameter settings using a forward model. Despite the recent interest in ABC, high-dimensional data and costly simulations still remain a bottleneck in some applications. There is also no consensus as to how to best assess the performance of such methods without knowing the true posterior. We show how a nonparametric conditional density estimation (CDE) framework, which we refer to as ABC-CDE, help address three nontrivial challenges in ABC: (i) how to efficiently estimate the posterior distribution with limited simulations and different types of data, (ii) how to tune and compare the performance of ABC and related methods in estimating the posterior itself, rather than just certain properties of the density, and (iii) how to efficiently choose among a large set of summary statistics based on a CDE surrogate loss. We provide theoretical and empirical evidence that justify ABC-CDE procedures that {\em directly} estimate and assess the posterior based on an initial ABC sample, and we describe settings where standard ABC and regression-based approaches are inadequate.