A Comprehensive Survey on Visual Concept Mining in Text-to-image Diffusion Models
This is an incremental survey that provides a systematic classification for researchers in AI and computer vision to improve model controllability.
The paper tackles the problem of text-to-image diffusion models' limited controllability due to textual signal constraints by surveying visual concept mining techniques that use reference images to enhance concept capture, categorizing existing research into four areas and identifying future directions.
Text-to-image diffusion models have made significant advancements in generating high-quality, diverse images from text prompts. However, the inherent limitations of textual signals often prevent these models from fully capturing specific concepts, thereby reducing their controllability. To address this issue, several approaches have incorporated personalization techniques, utilizing reference images to mine visual concept representations that complement textual inputs and enhance the controllability of text-to-image diffusion models. Despite these advances, a comprehensive, systematic exploration of visual concept mining remains limited. In this paper, we categorize existing research into four key areas: Concept Learning, Concept Erasing, Concept Decomposition, and Concept Combination. This classification provides valuable insights into the foundational principles of Visual Concept Mining (VCM) techniques. Additionally, we identify key challenges and propose future research directions to propel this important and interesting field forward.