MLCOLGJun 21, 2021

Stratified Learning: A General-Purpose Statistical Method for Improved Learning under Covariate Shift

arXiv:2106.11211v23 citations
Originality Incremental advance
AI Analysis

This addresses the issue of non-representative training data for researchers in fields like cosmology, though it builds on established causal inference techniques, making it incremental.

The authors tackled the problem of supervised learning under covariate shift by proposing a method that conditions on propensity scores to reduce its effects, achieving state-of-the-art results such as an AUC of 0.958 on a supernovae classification challenge and improved density estimation for galaxy redshift data.

We propose a simple, statistically principled, and theoretically justified method to improve supervised learning when the training set is not representative, a situation known as covariate shift. We build upon a well-established methodology in causal inference, and show that the effects of covariate shift can be reduced or eliminated by conditioning on propensity scores. In practice, this is achieved by fitting learners within strata constructed by partitioning the data based on the estimated propensity scores, leading to approximately balanced covariates and much-improved target prediction. We demonstrate the effectiveness of our general-purpose method on two contemporary research questions in cosmology, outperforming state-of-the-art importance weighting methods. We obtain the best reported AUC (0.958) on the updated "Supernovae photometric classification challenge", and we improve upon existing conditional density estimation of galaxy redshift from Sloan Data Sky Survey (SDSS) data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes