How good is my GAN?
This work addresses the need for better evaluation metrics in generative modeling, particularly for researchers and practitioners using GANs, though it is incremental as it builds on existing quantitative criteria.
The paper tackles the problem of quantitatively evaluating GANs by arguing that existing measures are insufficient and introducing two new measures, GAN-train and GAN-test, to approximate recall and precision. It demonstrates clear performance differences among recent GAN approaches and shows an inverse correlation between dataset difficulty and GAN quality.
Generative adversarial networks (GANs) are one of the most popular methods for generating images today. While impressive results have been validated by visual inspection, a number of quantitative criteria have emerged only recently. We argue here that the existing ones are insufficient and need to be in adequation with the task at hand. In this paper we introduce two measures based on image classification---GAN-train and GAN-test, which approximate the recall (diversity) and precision (quality of the image) of GANs respectively. We evaluate a number of recent GAN approaches based on these two measures and demonstrate a clear difference in performance. Furthermore, we observe that the increasing difficulty of the dataset, from CIFAR10 over CIFAR100 to ImageNet, shows an inverse correlation with the quality of the GANs, as clearly evident from our measures.