MLLGOct 24, 2022

Post-Selection Confidence Bounds for Prediction Performance

arXiv:2210.13206v32 citationsh-index: 39
Originality Incremental advance
AI Analysis

This addresses the need for robust post-selection inference in machine learning, offering a universally applicable method for model evaluation, though it is incremental in improving confidence bound computation.

The paper tackles the problem of providing valid confidence bounds for multiple models selected based on prediction performance, proposing an algorithm that uses bootstrap tilting and multiplicity correction to compute lower confidence bounds. The results show that the approach reliably achieves nominal coverage probability and yields better-performing models, especially with small sample sizes.

In machine learning, the selection of a promising model from a potentially large number of competing models and the assessment of its generalization performance are critical tasks that need careful consideration. Typically, model selection and evaluation are strictly separated endeavors, splitting the sample at hand into a training, validation, and evaluation set, and only compute a single confidence interval for the prediction performance of the final selected model. We however propose an algorithm how to compute valid lower confidence bounds for multiple models that have been selected based on their prediction performances in the evaluation set by interpreting the selection problem as a simultaneous inference problem. We use bootstrap tilting and a maxT-type multiplicity correction. The approach is universally applicable for any combination of prediction models, any model selection strategy, and any prediction performance measure that accepts weights. We conducted various simulation experiments which show that our proposed approach yields lower confidence bounds that are at least comparably good as bounds from standard approaches, and that reliably reach the nominal coverage probability. In addition, especially when sample size is small, our proposed approach yields better performing prediction models than the default selection of only one model for evaluation does.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes