Prompt-Propose-Verify: A Reliable Hand-Object-Interaction Data Generation Framework using Foundational Models
This addresses the issue of inaccurate hand generation in AI image synthesis for applications requiring realistic human-object interactions, but it is incremental as it builds on existing diffusion models with a new dataset and framework.
The paper tackled the problem of diffusion models generating inaccurate human features like hands in hand-object-interaction images by creating a well-annotated synthetic dataset using a Prompt-Propose-Verify framework and fine-tuning a stable diffusion model on it, resulting in considerably better performance over state-of-the-art benchmarks on metrics like CLIPScore and ImageReward.
Diffusion models when conditioned on text prompts, generate realistic-looking images with intricate details. But most of these pre-trained models fail to generate accurate images when it comes to human features like hands, teeth, etc. We hypothesize that this inability of diffusion models can be overcome through well-annotated good-quality data. In this paper, we look specifically into improving the hand-object-interaction image generation using diffusion models. We collect a well annotated hand-object interaction synthetic dataset curated using Prompt-Propose-Verify framework and finetune a stable diffusion model on it. We evaluate the image-text dataset on qualitative and quantitative metrics like CLIPScore, ImageReward, Fedility, and alignment and show considerably better performance over the current state-of-the-art benchmarks.