CVLGMLFeb 23, 2020

Reliable Fidelity and Diversity Metrics for Generative Models

arXiv:2002.09797v2569 citationsHas Code
AI Analysis

This work addresses a critical issue for researchers and practitioners in generative modeling by improving evaluation metrics, though it is incremental as it builds on existing precision and recall variants.

The paper tackled the problem of unreliable evaluation metrics for generative models, specifically showing that existing precision and recall metrics fail in key areas like detecting identical distributions and robustness to outliers, and proposed new density and coverage metrics that provide more interpretable and reliable signals.

Devising indicative evaluation metrics for the image generation task remains an open problem. The most widely used metric for measuring the similarity between real and generated images has been the Fréchet Inception Distance (FID) score. Because it does not differentiate the fidelity and diversity aspects of the generated images, recent papers have introduced variants of precision and recall metrics to diagnose those properties separately. In this paper, we show that even the latest version of the precision and recall metrics are not reliable yet. For example, they fail to detect the match between two identical distributions, they are not robust against outliers, and the evaluation hyperparameters are selected arbitrarily. We propose density and coverage metrics that solve the above issues. We analytically and experimentally show that density and coverage provide more interpretable and reliable signals for practitioners than the existing metrics. Code: https://github.com/clovaai/generative-evaluation-prdc.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes