CVJan 15, 2025

SHYI: Action Support for Contrastive Learning in High-Fidelity Text-to-Image Generation

Tianxiang Xia, Lin Xiao, Yannick Montorfani, Francesco Pavia, Enis Simsar, Thomas Hofmann

arXiv:2501.09055v13.6h-index: 10Has Code

Originality Incremental advance

AI Analysis

This addresses fidelity issues in text-to-image generation for complex actions, which is an incremental improvement over existing contrastive learning methods.

The paper tackles the problem of infidelity in text-to-image generation for actions involving multiple objects by enhancing contrastive learning with semantically hypergraphic contrastive adjacency learning and InteractDiffusion, showing promising results on verbs where Stable Diffusion performs mediocrely.

In this project, we address the issue of infidelity in text-to-image generation, particularly for actions involving multiple objects. For this we build on top of the CONFORM framework which uses Contrastive Learning to improve the accuracy of the generated image for multiple objects. However the depiction of actions which involves multiple different object has still large room for improvement. To improve, we employ semantically hypergraphic contrastive adjacency learning, a comprehension of enhanced contrastive structure and "contrast but link" technique. We further amend Stable Diffusion's understanding of actions by InteractDiffusion. As evaluation metrics we use image-text similarity CLIP and TIFA. In addition, we conducted a user study. Our method shows promising results even with verbs that Stable Diffusion understands mediocrely. We then provide future directions by analyzing the results. Our codebase can be found on polybox under the link: https://polybox.ethz.ch/index.php/s/dJm3SWyRohUrFxn

View on arXiv PDF

Similar