ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation
This addresses the need for efficient and accurate personalized image generation for users, though it is incremental as it builds on existing customization approaches.
The paper tackles the problem of customizing text-to-image generation with user-defined concepts by proposing a learning-based encoder that maps image features into textual embeddings, achieving high-fidelity inversion and robust editability with significantly faster encoding compared to optimization-based methods.
In addition to the unprecedented ability in imaginary creation, large text-to-image models are expected to take customized concepts in image generation. Existing works generally learn such concepts in an optimization-based manner, yet bringing excessive computation or memory burden. In this paper, we instead propose a learning-based encoder, which consists of a global and a local mapping networks for fast and accurate customized text-to-image generation. In specific, the global mapping network projects the hierarchical features of a given image into multiple new words in the textual word embedding space, i.e., one primary word for well-editable concept and other auxiliary words to exclude irrelevant disturbances (e.g., background). In the meantime, a local mapping network injects the encoded patch features into cross attention layers to provide omitted details, without sacrificing the editability of primary concepts. We compare our method with existing optimization-based approaches on a variety of user-defined concepts, and demonstrate that our method enables high-fidelity inversion and more robust editability with a significantly faster encoding process. Our code is publicly available at https://github.com/csyxwei/ELITE.