Understanding collections of related datasets using dependent MMD coresets
This work addresses the need for interpretable tools in machine learning to analyze dataset collections and improve generalization insights, though it appears incremental as it builds on existing MMD coreset methods.
The paper tackles the problem of comparing multiple related datasets to identify under-represented sub-populations and assess model generalization, by introducing dependent MMD coresets as a method for data summarization that facilitates distribution comparison.
Understanding how two datasets differ can help us determine whether one dataset under-represents certain sub-populations, and provides insights into how well models will generalize across datasets. Representative points selected by a maximum mean discrepency (MMD) coreset can provide interpretable summaries of a single dataset, but are not easily compared across datasets. In this paper we introduce dependent MMD coresets, a data summarization method for collections of datasets that facilitates comparison of distributions. We show that dependent MMD coresets are useful for understanding multiple related datasets and understanding model generalization between such datasets.