A Theoretical Framework for Statistical Evaluability of Generative Models

Shashaank Aiyer, Yishay Mansour, Shay Moran, Han Shao

arXiv:2604.053248.61 citationsh-index: 5

Predicted impact top 53% in LG · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses a foundational problem in machine learning for researchers and practitioners by providing theoretical insights into the reliability of evaluation metrics for generative models, though it is incremental as it builds on existing statistical theory.

The paper tackles the challenge of evaluating generative models by introducing a theoretical framework that establishes evaluability results for common metrics, showing that integral probability metrics can be evaluated from finite samples with precision under certain conditions, while Rényi and KL divergences are not evaluable due to rare events.

Statistical evaluation aims to estimate the generalization performance of a model using held-out i.i.d.\ test data sampled from the ground-truth distribution. In supervised learning settings such as classification, performance metrics such as error rate are well-defined, and test error reliably approximates population error given sufficiently large datasets. In contrast, evaluation is more challenging for generative models due to their open-ended nature: it is unclear which metrics are appropriate and whether such metrics can be reliably evaluated from finite samples. In this work, we introduce a theoretical framework for evaluating generative models and establish evaluability results for commonly used metrics. We study two categories of metrics: test-based metrics, including integral probability metrics (IPMs), and RÃ©nyi divergences. We show that IPMs with respect to any bounded test class can be evaluated from finite samples up to multiplicative and additive approximation errors. Moreover, when the test class has finite fat-shattering dimension, IPMs can be evaluated with arbitrary precision. In contrast, RÃ©nyi and KL divergences are not evaluable from finite samples, as their values can be critically determined by rare events. We also analyze the potential and limitations of perplexity as an evaluation method.

View on arXiv PDF

Similar