ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet
This work addresses mode collapse and improves versatility in human image generation tasks such as pose re-targeting and virtual try-on, representing a novel method for a known bottleneck rather than a foundational advancement.
The paper tackles the problem of concurrent spatial and visual conditioning for image generation by introducing ViscoNet, a lightweight one-branch-adapter architecture that requires significantly fewer trainable parameters and dataset size than the state-of-the-art IP-Adapter while preserving generative power and addressing mode collapse.
This paper introduces ViscoNet, a novel one-branch-adapter architecture for concurrent spatial and visual conditioning. Our lightweight model requires trainable parameters and dataset size multiple orders of magnitude smaller than the current state-of-the-art IP-Adapter. However, our method successfully preserves the generative power of the frozen text-to-image (T2I) backbone. Notably, it excels in addressing mode collapse, a pervasive issue previously overlooked. Our novel architecture demonstrates outstanding capabilities in achieving a harmonious visual-text balance, unlocking unparalleled versatility in various human image generation tasks, including pose re-targeting, virtual try-on, stylization, person re-identification, and textile transfer.Demo and code are available from project page https://soon-yau.github.io/visconet/ .