Revisiting Precision and Recall Definition for Generative Model Evaluation
This work addresses evaluation challenges for generative models, offering a refined tool to distinguish mode-collapse and quality issues, though it is incremental as it builds on existing definitions.
The paper revisits Precision-Recall curves for generative models, generalizing the formulation to arbitrary measures and linking it to error rates of likelihood ratio classifiers, and demonstrates improved performance on controlled multi-modal datasets.
In this article we revisit the definition of Precision-Recall (PR) curves for generative models proposed by Sajjadi et al. (arXiv:1806.00035). Rather than providing a scalar for generative quality, PR curves distinguish mode-collapse (poor recall) and bad quality (poor precision). We first generalize their formulation to arbitrary measures, hence removing any restriction to finite support. We also expose a bridge between PR curves and type I and type II error rates of likelihood ratio classifiers on the task of discriminating between samples of the two distributions. Building upon this new perspective, we propose a novel algorithm to approximate precision-recall curves, that shares some interesting methodological properties with the hypothesis testing technique from Lopez-Paz et al (arXiv:1610.06545). We demonstrate the interest of the proposed formulation over the original approach on controlled multi-modal datasets.