Evaluating the Quality and Diversity of DCGAN-based Generatively Synthesized Diabetic Retinopathy Imagery
This work addresses dataset imbalance for diabetic retinopathy diagnosis, but it is incremental as it applies existing evaluation metrics to a specific biomedical domain.
The paper tackled the problem of imbalanced diabetic retinopathy datasets by evaluating metrics for selecting high-quality and diverse synthetic images generated by a DCGAN, finding that FID assesses quality while MS-SSIM and CD assess diversity, and showing that augmented datasets improved CNN and EfficientNet classifier performance with F1 and AUC scores.
Publicly available diabetic retinopathy (DR) datasets are imbalanced, containing limited numbers of images with DR. This imbalance contributes to overfitting when training machine learning classifiers. The impact of this imbalance is exacerbated as the severity of the DR stage increases, affecting the classifiers' diagnostic capacity. The imbalance can be addressed using Generative Adversarial Networks (GANs) to augment the datasets with synthetic images. Generating synthetic images is advantageous if high-quality and diversified images are produced. To evaluate the quality and diversity of synthetic images, several evaluation metrics, such as Multi-Scale Structural Similarity Index (MS-SSIM), Cosine Distance (CD), and Fréchet Inception Distance (FID) are used. Understanding the effectiveness of each metric in evaluating the quality and diversity of GAN-based synthetic images is critical to select images for augmentation. To date, there has been limited analysis of the appropriateness of these metrics in the context of biomedical imagery. This work contributes an empirical assessment of these evaluation metrics as applied to synthetic Proliferative DR imagery generated by a Deep Convolutional GAN (DCGAN). Furthermore, the metrics' capacity to indicate the quality and diversity of synthetic images and a correlation with classifier performance is undertaken. This enables a quantitative selection of synthetic imagery and an informed augmentation strategy. Results indicate that FID is suitable for evaluating the quality, while MS-SSIM and CD are suitable for evaluating the diversity of synthetic imagery. Furthermore, the superior performance of Convolutional Neural Network (CNN) and EfficientNet classifiers, as indicated by the F1 and AUC scores, for the augmented datasets demonstrates the efficacy of synthetic imagery to augment the imbalanced dataset.