LG AI MEJan 31, 2022

Adaptive Sampling Strategies to Construct Equitable Training Datasets

William Cai, Ro Encarnacion, Bobbie Chern, Sam Corbett-Davies, Miranda Bogen, Stevie Bergman, Sharad Goel

arXiv:2202.01327v113.633 citations

Originality Incremental advance

AI Analysis

This addresses fairness issues in ML for underserved groups, but it is incremental as it builds on existing optimization and sampling methods.

The paper tackles the problem of performance disparities in machine learning models due to unrepresentative training data by formalizing equitable dataset creation as a constrained optimization problem, and in simulations on synthetic genomic data, their adaptive sampling strategy outperforms common heuristics like equal and proportional sampling.

In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities, often performing worse for members of traditionally underserved groups. One factor contributing to these performance gaps is a lack of representation in the data the models are trained on. It is often unclear, however, how to operationalize representativeness in specific applications. Here we formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem. We consider a setting where a model builder must decide how to allocate a fixed data collection budget to gather training data from different subgroups. We then frame dataset creation as a constrained optimization problem, in which one maximizes a function of group-specific performance metrics based on (estimated) group-specific learning rates and costs per sample. This flexible approach incorporates preferences of model-builders and other stakeholders, as well as the statistical properties of the learning task. When data collection decisions are made sequentially, we show that under certain conditions this optimization problem can be efficiently solved even without prior knowledge of the learning rates. To illustrate our approach, we conduct a simulation study of polygenic risk scores on synthetic genomic data -- an application domain that often suffers from non-representative data collection. We find that our adaptive sampling strategy outperforms several common data collection heuristics, including equal and proportional sampling, demonstrating the value of strategic dataset design for building equitable models.

View on arXiv PDF

Similar