CV LGSep 12, 2024

SIG: A Synthetic Identity Generation Pipeline for Generating Evaluation Datasets for Face Recognition

Kassi Nzalasse, Rishav Raj, Eli Laird, Corey Clark

arXiv:2409.08345v22 citationsh-index: 2Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of resource-intensive and ethically concerning data collection for researchers and developers in face recognition, though it is incremental as it builds on synthetic data generation methods.

The paper tackles the challenge of creating ethical and representative evaluation datasets for face recognition by introducing the Synthetic Identity Generation (SIG) pipeline, which generates high-quality synthetic face images with controllable attributes, and releases an open-source dataset ControlFace10k with 10,008 images of 3,336 identities to assess algorithmic bias.

As Artificial Intelligence applications expand, the evaluation of models faces heightened scrutiny. Ensuring public readiness requires evaluation datasets, which differ from training data by being disjoint and ethically sourced in compliance with privacy regulations. The performance and fairness of face recognition systems depend significantly on the quality and representativeness of these evaluation datasets. This data is sometimes scraped from the internet without user's consent, causing ethical concerns that can prohibit its use without proper releases. In rare cases, data is collected in a controlled environment with consent, however, this process is time-consuming, expensive, and logistically difficult to execute. This creates a barrier for those unable to conjure the immense resources required to gather ethically sourced evaluation datasets. To address these challenges, we introduce the Synthetic Identity Generation pipeline, or SIG, that allows for the targeted creation of ethical, balanced datasets for face recognition evaluation. Our proposed and demonstrated pipeline generates high-quality images of synthetic identities with controllable pose, facial features, and demographic attributes, such as race, gender, and age. We also release an open-source evaluation dataset named ControlFace10k, consisting of 10,008 face images of 3,336 unique synthetic identities balanced across race, gender, and age, generated using the proposed SIG pipeline. We analyze ControlFace10k along with a non-synthetic BUPT dataset using state-of-the-art face recognition algorithms to demonstrate its effectiveness as an evaluation tool. This analysis highlights the dataset's characteristics and its utility in assessing algorithmic bias across different demographic groups.

View on arXiv PDF

Similar