CV LGDec 2, 2024

Understanding Bias in Large-Scale Visual Datasets

Princeton

arXiv:2412.01876v117.320 citationsh-index: 5Has CodeNIPS

Originality Synthesis-oriented

AI Analysis

This work helps researchers understand bias in pre-training datasets to build more diverse ones, but it is incremental as it builds on prior findings without introducing a new paradigm.

The study tackled the problem of unclear bias forms in large-scale visual datasets by proposing a framework to identify unique visual attributes, using transformations and object-level analysis to assess bias types and generate detailed descriptions.

A recent study has shown that large-scale visual datasets are very biased: they can be easily classified by modern neural networks. However, the concrete forms of bias among these datasets remain unclear. In this study, we propose a framework to identify the unique visual attributes distinguishing these datasets. Our approach applies various transformations to extract semantic, structural, boundary, color, and frequency information from datasets, and assess how much each type of information reflects their bias. We further decompose their semantic bias with object-level analysis, and leverage natural language methods to generate detailed, open-ended descriptions of each dataset's characteristics. Our work aims to help researchers understand the bias in existing large-scale pre-training datasets, and build more diverse and representative ones in the future. Our project page and code are available at http://boyazeng.github.io/understand_bias .

View on arXiv PDF Code

Similar