Analyzing the Feature Extractor Networks for Face Image Synthesis
This work addresses the need for better evaluation criteria in face image synthesis, which is important for researchers in computer vision and generative AI, but it is incremental as it builds on existing methods without introducing a new paradigm.
This study tackled the problem of evaluating the realism of synthesized face images by analyzing the behavior of different feature extractors (InceptionV3, CLIP, DINOv2, ArcFace) using metrics like FID, KID, and Precision&Recall on datasets including FFHQ, CelebA-HQ, and synthetic ones, finding insights into their performance for face image synthesis assessment.
Advancements like Generative Adversarial Networks have attracted the attention of researchers toward face image synthesis to generate ever more realistic images. Thereby, the need for the evaluation criteria to assess the realism of the generated images has become apparent. While FID utilized with InceptionV3 is one of the primary choices for benchmarking, concerns about InceptionV3's limitations for face images have emerged. This study investigates the behavior of diverse feature extractors -- InceptionV3, CLIP, DINOv2, and ArcFace -- considering a variety of metrics -- FID, KID, Precision\&Recall. While the FFHQ dataset is used as the target domain, as the source domains, the CelebA-HQ dataset and the synthetic datasets generated using StyleGAN2 and Projected FastGAN are used. Experiments include deep-down analysis of the features: $L_2$ normalization, model attention during extraction, and domain distributions in the feature space. We aim to give valuable insights into the behavior of feature extractors for evaluating face image synthesis methodologies. The code is publicly available at https://github.com/ThEnded32/AnalyzingFeatureExtractors.