CVGRLGApr 18, 2022

Simultaneous Multiple-Prompt Guided Generation Using Differentiable Optimal Transport

Apple
arXiv:2204.08472v11 citationsh-index: 51
Originality Incremental advance
AI Analysis

This addresses a specific technical bottleneck in computational creativity tools for artists, though it appears to be an incremental improvement over existing text-to-image synthesis methods.

The paper tackles the problem of mode collapse in text-to-image synthesis where using mean distance between image patches and text prompts causes generated images to converge to an average of all prompts, losing diversity. The authors propose using optimal transport matching techniques instead, resulting in images that better reflect diverse prompts with both qualitative and quantitative improvements.

Recent advances in deep learning, such as powerful generative models and joint text-image embeddings, have provided the computational creativity community with new tools, opening new perspectives for artistic pursuits. Text-to-image synthesis approaches that operate by generating images from text cues provide a case in point. These images are generated with a latent vector that is progressively refined to agree with text cues. To do so, patches are sampled within the generated image, and compared with the text prompts in the common text-image embedding space; The latent vector is then updated, using gradient descent, to reduce the mean (average) distance between these patches and text cues. While this approach provides artists with ample freedom to customize the overall appearance of images, through their choice in generative models, the reliance on a simple criterion (mean of distances) often causes mode collapse: The entire image is drawn to the average of all text cues, thereby losing their diversity. To address this issue, we propose using matching techniques found in the optimal transport (OT) literature, resulting in images that are able to reflect faithfully a wide diversity of prompts. We provide numerous illustrations showing that OT avoids some of the pitfalls arising from estimating vectors with mean distances, and demonstrate the capacity of our proposed method to perform better in experiments, qualitatively and quantitatively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes