Tyler McCormick

ML
h-index3
3papers
3citations
Novelty43%
AI Score34

3 Papers

61.7MLMar 11
Spatially Robust Inference with Predicted and Missing at Random Labels

Stephen Salerno, Zhenke Wu, Tyler McCormick

When outcome data are expensive or onerous to collect, scientists increasingly substitute predictions from machine learning and AI models for unlabeled cases, a process which has consequences for downstream statistical inference. While recent methods provide valid uncertainty quantification under independent sampling, real-world applications involve missing at random (MAR) labeling and spatial dependence. For inference in this setting, we propose a doubly robust estimator with cross-fit nuisances. We show that cross-fitting induces fold-level correlation that distorts spatial variance estimators, producing unstable or overly conservative confidence intervals. To address this, we propose a jackknife spatial heteroscedasticity and autocorrelation consistent (HAC) variance correction that separates spatial dependence from fold-induced noise. Under standard identification and dependence conditions, the resulting intervals are asymptotically valid. Simulations and benchmark datasets show substantial improvement in finite-sample calibration, particularly under MAR labeling and clustered sampling.

MLMar 9, 2025
Unique Rashomon Sets for Robust Active Learning

Simon Nguyen, Kentaro Hoffman, Tyler McCormick

Collecting labeled data for machine learning models is often expensive and time-consuming. Active learning addresses this challenge by selectively labeling the most informative observations, but when initial labeled data is limited, it becomes difficult to distinguish genuinely informative points from those appearing uncertain primarily due to noise. Ensemble methods like random forests are a powerful approach to quantifying this uncertainty but do so by aggregating all models indiscriminately. This includes poor performing models and redundant models, a problem that worsens in the presence of noisy data. We introduce UNique Rashomon Ensembled Active Learning (UNREAL), which selectively ensembles only distinct models from the Rashomon set, which is the set of nearly optimal models. Restricting ensemble membership to high-performing models with different explanations helps distinguish genuine uncertainty from noise-induced variation. We show that UNREAL achieves faster theoretical convergence rates than traditional active learning approaches and demonstrates empirical improvements of up to 20% in predictive accuracy across five benchmark datasets, while simultaneously enhancing model interpretability.

MLMay 26, 2021
The "given data" paradigm undermines both cultures

Tyler McCormick

Breiman organizes "Statistical modeling: The two cultures" around a simple visual. Data, to the far right, are compelled into a "black box" with an arrow and then catapulted left by a second arrow, having been transformed into an output. Breiman then posits two interpretations of this visual as encapsulating a distinction between two cultures in statistics. The divide, he argues is about what happens in the "black box." In this comment, I argue for a broader perspective on statistics and, in doing so, elevate questions from "before" and "after" the box as fruitful areas for statistical innovation and practice.