MIFO: Learning and Synthesizing Multi-Instance from One Image
This addresses a challenge in image synthesis for applications requiring precise control over multiple instances from limited data, though it appears incremental in method.
The paper tackles the problem of learning and synthesizing multi-instance semantics from a single image, achieving disentangled and high-quality results with robust performance on similar or rare objects.
This paper proposes a method for precise learning and synthesizing multi-instance semantics from a single image. The difficulty of this problem lies in the limited training data, and it becomes even more challenging when the instances to be learned have similar semantics or appearance. To address this, we propose a penalty-based attention optimization to disentangle similar semantics during the learning stage. Then, in the synthesis, we introduce and optimize box control in attention layers to further mitigate semantic leakage while precisely controlling the output layout. Experimental results demonstrate that our method achieves disentangled and high-quality semantic learning and synthesis, strikingly balancing editability and instance consistency. Our method remains robust when dealing with semantically or visually similar instances or rare-seen objects. The code is publicly available at https://github.com/Kareneveve/MIFO