LGOct 31, 2022
Cost-aware Generalized $α$-investing for Multiple Hypothesis TestingThomas Cook, Harsh Vardhan Dubey, Ji Ah Lee et al.
We consider the problem of sequential multiple hypothesis testing with nontrivial data collection costs. This problem appears, for example, when conducting biological experiments to identify differentially expressed genes of a disease process. This work builds on the generalized $α$-investing framework which enables control of the false discovery rate in a sequential testing setting. We make a theoretical analysis of the long term asymptotic behavior of $α$-wealth which motivates a consideration of sample size in the $α$-investing decision rule. Posing the testing process as a game with nature, we construct a decision rule that optimizes the expected $α$-wealth reward (ERO) and provides an optimal sample size for each test. Empirical results show that a cost-aware ERO decision rule correctly rejects more false null hypotheses than other methods for $n=1$ where $n$ is the sample size. When the sample size is not fixed cost-aware ERO uses a prior on the null hypothesis to adaptively allocate of the sample budget to each test. We extend cost-aware ERO investing to finite-horizon testing which enables the decision rule to allocate samples in a non-myopic manner. Finally, empirical tests on real data sets from biological experiments show that cost-aware ERO balances the allocation of samples to an individual test against the allocation of samples across multiple tests.
MLOct 24, 2024
Maximum a Posteriori Inference for Factor Graphs via Benders' DecompositionHarsh Vardhan Dubey, Ji Ah Lee, Patrick Flaherty
Many Bayesian statistical inference problems come down to computing a maximum a-posteriori (MAP) assignment of latent variables. Yet, standard methods for estimating the MAP assignment do not have a finite time guarantee that the algorithm has converged to a fixed point. Previous research has found that MAP inference can be represented in dual form as a linear programming problem with a non-polynomial number of constraints. A Lagrangian relaxation of the dual yields a statistical inference algorithm as a linear programming problem. However, the decision as to which constraints to remove in the relaxation is often heuristic. We present a method for maximum a-posteriori inference in general Bayesian factor models that sequentially adds constraints to the fully relaxed dual problem using Benders' decomposition. Our method enables the incorporation of expressive integer and logical constraints in clustering problems such as must-link, cannot-link, and a minimum number of whole samples allocated to each cluster. Using this approach, we derive MAP estimation algorithms for the Bayesian Gaussian mixture model and latent Dirichlet allocation. Empirical results show that our method produces a higher optimal posterior value compared to Gibbs sampling and variational Bayes methods for standard data sets and provides certificate of convergence.
MLNov 8, 2019
MAP Clustering under the Gaussian Mixture Model via Mixed Integer Nonlinear OptimizationPatrick Flaherty, Pitchaya Wiratchotisatian, Ji Ah Lee et al.
We present a global optimization approach for solving the maximum a-posteriori (MAP) clustering problem under the Gaussian mixture model.Our approach can accommodate side constraints and it preserves the combinatorial structure of the MAP clustering problem by formulating it asa mixed-integer nonlinear optimization problem (MINLP). We approximate the MINLP through a mixed-integer quadratic program (MIQP) transformation that improves computational aspects while guaranteeing $ε$-global optimality. An important benefit of our approach is the explicit quantification of the degree of suboptimality, via the optimality gap, en route to finding the globally optimal MAP clustering. Numerical experiments comparing our method to other approaches show that our method finds a better solution than standard clustering methods. Finally, we cluster a real breast cancer gene expression data set incorporating intrinsic subtype information; the induced constraints substantially improve the computational performance and produce more coherent and bio-logically meaningful clusters.