A Decade's Battle on Dataset Bias: Are We There Yet?
This highlights ongoing dataset bias issues in machine learning, which could affect model generalization and fairness across applications.
The study revisited a dataset classification experiment from 2011 using modern datasets and neural networks, finding that models achieve 84.7% accuracy in classifying images by dataset, indicating persistent dataset bias.
We revisit the "dataset classification" experiment suggested by Torralba & Efros (2011) a decade ago, in the new era with large-scale, diverse, and hopefully less biased datasets as well as more capable neural network architectures. Surprisingly, we observe that modern neural networks can achieve excellent accuracy in classifying which dataset an image is from: e.g., we report 84.7% accuracy on held-out validation data for the three-way classification problem consisting of the YFCC, CC, and DataComp datasets. Our further experiments show that such a dataset classifier could learn semantic features that are generalizable and transferable, which cannot be explained by memorization. We hope our discovery will inspire the community to rethink issues involving dataset bias.