AI CV LGDec 9, 2024

Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty

Meera Hahn, Wenjun Zeng, Nithish Kannen, Rich Galt, Kartikeya Badola, Been Kim, Zi Wang

arXiv:2412.06771v320.019 citationsh-index: 9Has CodeICML

Originality Incremental advance

AI Analysis

This addresses the user burden of refining prompts in generative AI, though it appears incremental as it builds on existing T2I models with an interactive interface.

The paper tackles the problem of misalignment between user intent and model understanding in text-to-image generation by proposing proactive agents that ask clarification questions and present uncertainty as editable belief graphs, achieving at least 2 times higher VQAScore than standard methods and 90% human approval for workflow helpfulness.

User prompts for generative AI models are often underspecified, leading to a misalignment between the user intent and models' understanding. As a result, users commonly have to painstakingly refine their prompts. We study this alignment problem in text-to-image (T2I) generation and propose a prototype for proactive T2I agents equipped with an interface to (1) actively ask clarification questions when uncertain, and (2) present their uncertainty about user intent as an understandable and editable belief graph. We build simple prototypes for such agents and propose a new scalable and automated evaluation approach using two agents, one with a ground truth intent (an image) while the other tries to ask as few questions as possible to align with the ground truth. We experiment over three image-text datasets: ImageInWords (Garg et al., 2024), COCO (Lin et al., 2014) and DesignBench, a benchmark we curated with strong artistic and design elements. Experiments over the three datasets demonstrate the proposed T2I agents' ability to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore (Lin et al., 2024) than the standard T2I generation. Moreover, we conducted human studies and observed that at least 90% of human subjects found these agents and their belief graphs helpful for their T2I workflow, highlighting the effectiveness of our approach. Code and DesignBench can be found at https://github.com/google-deepmind/proactive_t2i_agents.

View on arXiv PDF Code

Similar