ML LGMay 17, 2019

Merging versus Ensembling in Multi-Study Prediction: Theoretical Insight from Random Effects

Zoe Guan, Giovanni Parmigiani, Prasad Patil

arXiv:1905.07382v44.93 citationsHas Code

Originality Incremental advance

AI Analysis

This provides theoretical guidance for researchers in fields like metagenomics on when to combine or separate studies in multi-study prediction, though it is incremental as it builds on existing ridge regression methods.

The paper tackles the problem of deciding whether to merge datasets or use multi-study ensembling for prediction when predictor-outcome relationships vary across studies, showing analytically and via simulation that merging yields lower error in homogeneous cases, but ensembling outperforms beyond a transition point as heterogeneity increases.

A critical decision point when training predictors using multiple studies is whether studies should be combined or treated separately. We compare two multi-study prediction approaches in the presence of potential heterogeneity in predictor-outcome relationships across datasets: 1) merging all of the datasets and training a single learner, and 2) multi-study ensembling, which involves training a separate learner on each dataset and combining the predictions resulting from each learner. For ridge regression, we show analytically and confirm via simulation that merging yields lower prediction error than ensembling when the predictor-outcome relationships are relatively homogeneous across studies. However, as cross-study heterogeneity increases, there exists a transition point beyond which ensembling outperforms merging. We provide analytic expressions for the transition point in various scenarios, study asymptotic properties, and illustrate how transition point theory can be used for deciding when studies should be combined with an application from metagenomics.

View on arXiv PDF Code

Similar