CV AIJan 30

PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories

Gemma Canet Tarrés, Manel Baradad, Francesc Moreno-Noguer, Yumeng Li

Amazon

arXiv:2602.00267v12.81 citationsh-index: 8

Originality Incremental advance

AI Analysis

This work addresses the challenge of high-fidelity multi-object compositing for professional design and content creation, representing an incremental improvement over current state-of-the-art methods.

The paper tackles the problem of multi-object compositing for studio-level applications, where existing generative models often fail to preserve object identity and layout fidelity. The proposed PLACID framework uses a pretrained video diffusion model with synthetic trajectories to achieve superior identity preservation and fewer omissions, as validated by quantitative evaluations and user studies.

Recent advances in generative AI have dramatically improved photorealistic image synthesis, yet they fall short for studio-level multi-object compositing. This task demands simultaneous (i) near-perfect preservation of each item's identity, (ii) precise background and color fidelity, (iii) layout and design elements control, and (iv) complete, appealing displays showcasing all objects. However, current state-of-the-art models often alter object details, omit or duplicate objects, and produce layouts with incorrect relative sizing or inconsistent item presentations. To bridge this gap, we introduce PLACID, a framework that transforms a collection of object images into an appealing multi-object composite. Our approach makes two main contributions. First, we leverage a pretrained image-to-video (I2V) diffusion model with text control to preserve objects consistency, identities, and background details by exploiting temporal priors from videos. Second, we propose a novel data curation strategy that generates synthetic sequences where randomly placed objects smoothly move to their target positions. This synthetic data aligns with the video model's temporal priors during training. At inference, objects initialized at random positions consistently converge into coherent layouts guided by text, with the final frame serving as the composite image. Extensive quantitative evaluations and user studies demonstrate that PLACID surpasses state-of-the-art methods in multi-object compositing, achieving superior identity, background, and color preservation, with less omitted objects and visually appealing results.

View on arXiv PDF

Similar