CVMar 21, 2024

Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization

arXiv:2403.14155v16 citationsh-index: 15AAAI
Originality Incremental advance
AI Analysis

This addresses a specific bottleneck in zero-shot text-to-image customization for users needing subject-consistent image generation, representing an incremental improvement over existing methods.

The paper tackles the problem of zero-shot text-to-image customization where existing methods generate images with fixed poses and deteriorate subject identity, by proposing orthogonal visual embedding and self-attention swap to harmonize visual and textual embeddings, resulting in highly flexible generation with maintained subject identity.

In a surge of text-to-image (T2I) models and their customization methods that generate new images of a user-provided subject, current works focus on alleviating the costs incurred by a lengthy per-subject optimization. These zero-shot customization methods encode the image of a specified subject into a visual embedding which is then utilized alongside the textual embedding for diffusion guidance. The visual embedding incorporates intrinsic information about the subject, while the textual embedding provides a new, transient context. However, the existing methods often 1) are significantly affected by the input images, eg., generating images with the same pose, and 2) exhibit deterioration in the subject's identity. We first pin down the problem and show that redundant pose information in the visual embedding interferes with the textual embedding containing the desired pose information. To address this issue, we propose orthogonal visual embedding which effectively harmonizes with the given textual embedding. We also adopt the visual-only embedding and inject the subject's clear features utilizing a self-attention swap. Our results demonstrate the effectiveness and robustness of our method, which offers highly flexible zero-shot generation while effectively maintaining the subject's identity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes