A Method for Curation of Web-Scraped Face Image Datasets
This addresses the need for clean, gender-balanced datasets to support fair comparisons in face recognition research, though it is incremental as it builds on existing automated methods.
The paper tackles the problem of cleaning web-scraped face image datasets, which often contain errors like mislabeled identities and duplicates, by proposing a semi-automated curation method that improves face recognition accuracy, with experiments showing a state-of-the-art method achieving much higher accuracy on curated datasets.
Web-scraped, in-the-wild datasets have become the norm in face recognition research. The numbers of subjects and images acquired in web-scraped datasets are usually very large, with number of images on the millions scale. A variety of issues occur when collecting a dataset in-the-wild, including images with the wrong identity label, duplicate images, duplicate subjects and variation in quality. With the number of images being in the millions, a manual cleaning procedure is not feasible. But fully automated methods used to date result in a less-than-ideal level of clean dataset. We propose a semi-automated method, where the goal is to have a clean dataset for testing face recognition methods, with similar quality across men and women, to support comparison of accuracy across gender. Our approach removes near-duplicate images, merges duplicate subjects, corrects mislabeled images, and removes images outside a defined range of pose and quality. We conduct the curation on the Asian Face Dataset (AFD) and VGGFace2 test dataset. The experiments show that a state-of-the-art method achieves a much higher accuracy on the datasets after they are curated. Finally, we release our cleaned versions of both datasets to the research community.