MLLGNov 17, 2021

Sampling To Improve Predictions For Underrepresented Observations In Imbalanced Data

arXiv:2111.09065v34 citations
Originality Synthesis-oriented
AI Analysis

This work addresses model fairness and performance for underrepresented groups in imbalanced biopharmaceutical manufacturing data, but it is incremental as it applies existing sampling methods to a new domain.

The authors tackled the problem of data imbalance in production settings, which harms predictive performance on underrepresented observations, by testing three sampling approaches on a penicillin production dataset and found that using sampled data slightly reduced overall performance but systematically improved predictions for underrepresented observations.

Data imbalance is common in production data, where controlled production settings require data to fall within a narrow range of variation and data are collected with quality assessment in mind, rather than data analytic insights. This imbalance negatively impacts the predictive performance of models on underrepresented observations. We propose sampling to adjust for this imbalance with the goal of improving the performance of models trained on historical production data. We investigate the use of three sampling approaches to adjust for imbalance. The goal is to downsample the covariates in the training data and subsequently fit a regression model. We investigate how the predictive power of the model changes when using either the sampled or the original data for training. We apply our methods on a large biopharmaceutical manufacturing data set from an advanced simulation of penicillin production and find that fitting a model using the sampled data gives a small reduction in the overall predictive performance, but yields a systematically better performance on underrepresented observations. In addition, the results emphasize the need for alternative, fair, and balanced model evaluations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes