Diversify and Disambiguate: Learning From Underspecified Data
It addresses underspecification in ML datasets, which can cause poor out-of-distribution performance, offering a method to improve robustness.
The paper tackles the problem of underspecified datasets where multiple solutions exist, proposing DivDis to learn diverse hypotheses and select one using minimal extra supervision, achieving robust feature usage in image and NLP tasks.
Many datasets are underspecified: there exist multiple equally viable solutions to a given task. Underspecification can be problematic for methods that learn a single hypothesis because different functions that achieve low training loss can focus on different predictive features and thus produce widely varying predictions on out-of-distribution data. We propose DivDis, a simple two-stage framework that first learns a diverse collection of hypotheses for a task by leveraging unlabeled data from the test distribution. We then disambiguate by selecting one of the discovered hypotheses using minimal additional supervision, in the form of additional labels or inspection of function visualization. We demonstrate the ability of DivDis to find hypotheses that use robust features in image classification and natural language processing problems with underspecification.