CVAIDec 5, 2023

ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet

arXiv:2312.03154v29 citationsh-index: 4ECCV Workshops
AI Analysis

This work addresses mode collapse and improves versatility in human image generation tasks such as pose re-targeting and virtual try-on, representing a novel method for a known bottleneck rather than a foundational advancement.

The paper tackles the problem of concurrent spatial and visual conditioning for image generation by introducing ViscoNet, a lightweight one-branch-adapter architecture that requires significantly fewer trainable parameters and dataset size than the state-of-the-art IP-Adapter while preserving generative power and addressing mode collapse.

This paper introduces ViscoNet, a novel one-branch-adapter architecture for concurrent spatial and visual conditioning. Our lightweight model requires trainable parameters and dataset size multiple orders of magnitude smaller than the current state-of-the-art IP-Adapter. However, our method successfully preserves the generative power of the frozen text-to-image (T2I) backbone. Notably, it excels in addressing mode collapse, a pervasive issue previously overlooked. Our novel architecture demonstrates outstanding capabilities in achieving a harmonious visual-text balance, unlocking unparalleled versatility in various human image generation tasks, including pose re-targeting, virtual try-on, stylization, person re-identification, and textile transfer.Demo and code are available from project page https://soon-yau.github.io/visconet/ .

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes