LGOct 18, 2023

On the Distributed Evaluation of Generative Models

Zixiao Wang, Farzan Farnia, Zhenghao Lin, Yunheng Shen, Bei Yu

arXiv:2310.11714v410.78 citationsh-index: 15

Originality Incremental advance

AI Analysis

This addresses the challenge of reliable model evaluation in federated learning and other distributed applications, providing insights for practitioners, but it is incremental as it builds on existing metrics without proposing new ones.

The paper tackles the problem of evaluating generative models in distributed settings with heterogeneous data, showing that averaging Kernel Inception Distance (KID) scores across clients preserves model rankings compared to centralized evaluation, but Fréchet Inception Distance (FID) scores can lead to discrepancies, as demonstrated through theoretical proofs and numerical experiments on standard image datasets.

The evaluation of deep generative models has been extensively studied in the centralized setting, where the reference data are drawn from a single probability distribution. On the other hand, several applications of generative models concern distributed settings, e.g. the federated learning setting, where the reference data for conducting evaluation are provided by several clients in a network. In this paper, we study the evaluation of generative models in such distributed contexts with potentially heterogeneous data distributions across clients. We focus on the widely-used distance-based evaluation metrics, Fréchet Inception Distance (FID) and Kernel Inception Distance (KID). In the case of KID metric, we prove that scoring a group of generative models using the clients' averaged KID score will result in the same ranking as that of a centralized KID evaluation over a collective reference set containing all the clients' data. In contrast, we show the same result does not apply to the FID-based evaluation. We provide examples in which two generative models are assigned the same FID score by each client in a distributed setting, while the centralized FID scores of the two models are significantly different. We perform several numerical experiments on standard image datasets and generative models to support our theoretical results on the distributed evaluation of generative models using FID and KID scores.

View on arXiv PDF

Similar