CV LGNov 7, 2023

Exploring Dataset-Scale Indicators of Data Quality

arXiv:2311.04016v13.92 citationsh-index: 11

Originality Incremental advance

AI Analysis

This work addresses data quality issues for researchers and practitioners in computer vision, offering incremental improvements through dataset-level analysis.

The paper tackles the problem of defining and improving data quality for computer vision foundation models by decomposing it into sample-level and dataset-level constituents, focusing on label set design and class balance, and provides indicators to anticipate model performance in accuracy and robustness to distribution shifts.

Modern computer vision foundation models are trained on massive amounts of data, incurring large economic and environmental costs. Recent research has suggested that improving data quality can significantly reduce the need for data quantity. But what constitutes data quality in computer vision? We posit that the quality of a given dataset can be decomposed into distinct sample-level and dataset-level constituents, and that the former have been more extensively studied than the latter. We ablate the effects of two important dataset-level constituents: label set design, and class balance. By monitoring these constituents using key indicators we provide, researchers and practitioners can better anticipate model performance, measured in terms of its accuracy and robustness to distribution shifts.

View on arXiv PDF

Similar