Balanced Face Dataset: Guiding StyleGAN to Generate Labeled Synthetic Face Image Dataset for Underrepresented Group
This addresses bias in machine learning models for face-related applications by providing a cost-effective synthetic dataset, though it is incremental as it builds on existing StyleGAN methods.
The study tackled the problem of underrepresented groups in face datasets by generating a balanced synthetic face image dataset using StyleGAN, controlling the generation process to ensure demographic diversity and annotating it for downstream tasks.
For a machine learning model to generalize effectively to unseen data within a particular problem domain, it is well-understood that the data needs to be of sufficient size and representative of real-world scenarios. Nonetheless, real-world datasets frequently have overrepresented and underrepresented groups. One solution to mitigate bias in machine learning is to leverage a diverse and representative dataset. Training a model on a dataset that covers all demographics is crucial to reducing bias in machine learning. However, collecting and labeling large-scale datasets has been challenging, prompting the use of synthetic data generation and active labeling to decrease the costs of manual labeling. The focus of this study was to generate a robust face image dataset using the StyleGAN model. In order to achieve a balanced distribution of the dataset among different demographic groups, a synthetic dataset was created by controlling the generation process of StyleGaN and annotated for different downstream tasks.