CVLGApr 7, 2020

A Method for Curation of Web-Scraped Face Image Datasets

arXiv:2004.03074v13 citations
AI Analysis

This addresses the need for clean, gender-balanced datasets to support fair comparisons in face recognition research, though it is incremental as it builds on existing automated methods.

The paper tackles the problem of cleaning web-scraped face image datasets, which often contain errors like mislabeled identities and duplicates, by proposing a semi-automated curation method that improves face recognition accuracy, with experiments showing a state-of-the-art method achieving much higher accuracy on curated datasets.

Web-scraped, in-the-wild datasets have become the norm in face recognition research. The numbers of subjects and images acquired in web-scraped datasets are usually very large, with number of images on the millions scale. A variety of issues occur when collecting a dataset in-the-wild, including images with the wrong identity label, duplicate images, duplicate subjects and variation in quality. With the number of images being in the millions, a manual cleaning procedure is not feasible. But fully automated methods used to date result in a less-than-ideal level of clean dataset. We propose a semi-automated method, where the goal is to have a clean dataset for testing face recognition methods, with similar quality across men and women, to support comparison of accuracy across gender. Our approach removes near-duplicate images, merges duplicate subjects, corrects mislabeled images, and removes images outside a defined range of pose and quality. We conduct the curation on the Asian Face Dataset (AFD) and VGGFace2 test dataset. The experiments show that a state-of-the-art method achieves a much higher accuracy on the datasets after they are curated. Finally, we release our cleaned versions of both datasets to the research community.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes