CVFeb 7, 2025

Multitwine: Multi-Object Compositing with Text and Layout Control

arXiv:2502.05165v18 citationsh-index: 9CVPR
Originality Highly original
AI Analysis

This work addresses the problem of text-driven object compositing for applications such as image and scene generation, which is significant for researchers and developers in the field of computer vision and graphics.

The authors tackled the problem of multi-object compositing with text and layout control, achieving state-of-the-art performance in both compositing and subject-driven generation tasks. Their model can add multiple objects within a scene, capturing various interactions and autonomously generating supporting objects when needed.

We introduce the first generative model capable of simultaneous multi-object compositing, guided by both text and layout. Our model allows for the addition of multiple objects within a scene, capturing a range of interactions from simple positional relations (e.g., next to, in front of) to complex actions requiring reposing (e.g., hugging, playing guitar). When an interaction implies additional props, like `taking a selfie', our model autonomously generates these supporting objects. By jointly training for compositing and subject-driven generation, also known as customization, we achieve a more balanced integration of textual and visual inputs for text-driven object compositing. As a result, we obtain a versatile model with state-of-the-art performance in both tasks. We further present a data generation pipeline leveraging visual and language models to effortlessly synthesize multimodal, aligned training data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes