SSP: A Simple and Safe automatic Prompt engineering method towards realistic image synthesis on LVM
This addresses safety and quality issues in image generation for users of large vision models, but it is incremental as it builds on existing prompt engineering techniques.
The paper tackles the problem of unsafe factors in text-to-image synthesis prompts by proposing SSP, a simple and safe prompt engineering method that appends optimal camera descriptions, resulting in an average 16% improvement in semantic consistency and 48.9% in safety metrics.
Recently, text-to-image (T2I) synthesis has undergone significant advancements, particularly with the emergence of Large Language Models (LLM) and their enhancement in Large Vision Models (LVM), greatly enhancing the instruction-following capabilities of traditional T2I models. Nevertheless, previous methods focus on improving generation quality but introduce unsafe factors into prompts. We explore that appending specific camera descriptions to prompts can enhance safety performance. Consequently, we propose a simple and safe prompt engineering method (SSP) to improve image generation quality by providing optimal camera descriptions. Specifically, we create a dataset from multi-datasets as original prompts. To select the optimal camera, we design an optimal camera matching approach and implement a classifier for original prompts capable of automatically matching. Appending camera descriptions to original prompts generates optimized prompts for further LVM image generation. Experiments demonstrate that SSP improves semantic consistency by an average of 16% compared to others and safety metrics by 48.9%.