Striving for data-model efficiency: Identifying data externalities on group performance
This addresses data-model inefficiencies for improving trustworthy and effective machine learning systems, though it is incremental in characterizing and detecting these issues.
The paper tackles the problem of negative data externalities, where adding training data from certain sources can lower predictive performance on key sub-groups, and finds that these externalities can arise in standard learning settings and depend on conditions like training set size and model size, implying a lower bound on feasible model improvements.
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance. In this work, we seek to better understand how we might characterize, detect, and design for data-model synergies. We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population, a phenomenon we refer to as negative data externalities on group performance. Such externalities can arise in standard learning settings and can manifest differently depending on conditions between training set size and model size. Data externalities directly imply a lower bound on feasible model improvements, yet improving models efficiently requires understanding the underlying data-model tensions. From a broader perspective, our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.