CVDec 24, 2024

Scendi Score: Prompt-Aware Diversity Evaluation via Schur Complement of CLIP Embeddings

Azim Ospanov, Mohammad Jalali, Farzan Farnia

arXiv:2412.18645v312.19 citationsh-index: 15Has Code

Originality Incremental advance

AI Analysis

This provides a new evaluation metric for researchers and developers working on text-to-image and text-to-text generative models to assess prompt-aware diversity, though it is incremental as it builds on existing CLIP-based methods.

The paper tackles the problem of evaluating the diversity of images generated by text-to-image models, which existing metrics like CLIPScore do not measure, by proposing the Scendi score based on a Schur complement decomposition of CLIP embeddings, and shows it successfully captures intrinsic diversity in numerical results.

The use of CLIP embeddings to assess the fidelity of samples produced by text-to-image generative models has been extensively explored in the literature. While the widely adopted CLIPScore, derived from the cosine similarity of text and image embeddings, effectively measures the alignment of a generated image, it does not quantify the diversity of images generated by a text-to-image model. In this work, we extend the application of CLIP embeddings to quantify and interpret the intrinsic diversity of text-to-image models, which are responsible for generating diverse images from similar text prompts, which we refer to as prompt-aware diversity. To achieve this, we propose a decomposition of the CLIP-based kernel covariance matrix of image data into text-based and non-text-based components. Using the Schur complement of the joint image-text kernel covariance matrix, we perform this decomposition and define the matrix-based entropy of the decomposed component as the Schur Complement ENtopy DIversity (Scendi) score, as a measure of the prompt-aware diversity for prompt-guided generative models. Additionally, we discuss the application of the Schur complement-based decomposition to nullify the influence of a given prompt on the CLIP embedding of an image, enabling focus or defocus of the embedded vectors on specific objects. We present several numerical results that apply our proposed Scendi score to evaluate text-to-image and LLM (text-to-text) models. Our numerical results indicate the success of the Scendi score in capturing the intrinsic diversity of prompt-guided generative models. The codebase is available at https://github.com/aziksh-ospanov/scendi-score.

View on arXiv PDF Code

Similar