CVAIOct 26, 2023

Attribute Based Interpretable Evaluation Metrics for Generative Models

arXiv:2310.17261v34 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses the need for explainable evaluation metrics in generative AI, providing insights into model failures that standard metrics miss, though it is incremental in building on existing attribute-based analysis.

The authors tackled the problem of evaluating generative models beyond diversity by proposing interpretable metrics that measure divergence in attribute distributions between generated and training data, revealing specific weaknesses in existing models such as implausible attribute relationships in ProjectedGAN and color diversity issues in diffusion models.

When the training dataset comprises a 1:1 proportion of dogs to cats, a generative model that produces 1:1 dogs and cats better resembles the training species distribution than another model with 3:1 dogs and cats. Can we capture this phenomenon using existing metrics? Unfortunately, we cannot, because these metrics do not provide any interpretability beyond "diversity". In this context, we propose a new evaluation protocol that measures the divergence of a set of generated images from the training set regarding the distribution of attribute strengths as follows. Single-attribute Divergence (SaD) measures the divergence regarding PDFs of a single attribute. Paired-attribute Divergence (PaD) measures the divergence regarding joint PDFs of a pair of attributes. They provide which attributes the models struggle. For measuring the attribute strengths of an image, we propose Heterogeneous CLIPScore (HCS) which measures the cosine similarity between image and text vectors with heterogeneous initial points. With SaD and PaD, we reveal the following about existing generative models. ProjectedGAN generates implausible attribute relationships such as a baby with a beard even though it has competitive scores of existing metrics. Diffusion models struggle to capture diverse colors in the datasets. The larger sampling timesteps of latent diffusion model generate the more minor objects including earrings and necklaces. Stable Diffusion v1.5 better captures the attributes than v2.1. Our metrics lay a foundation for explainable evaluations of generative models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes