CVDec 13, 2024

Learning Complex Non-Rigid Image Edits from Multimodal Conditioning

arXiv:2412.10219v1h-index: 11
Originality Incremental advance
AI Analysis

This addresses the challenge of realistic human image editing for applications like content creation, though it is incremental as it builds on Stable Diffusion.

The paper tackles the problem of inserting a person from a reference image into a novel scene with natural appearance and control via text and pose, achieving this by training on a novel dataset of image pairs with automatically generated pose-difference captions, which improves identity preservation and person-object interactions in complex scenes.

In this paper we focus on inserting a given human (specifically, a single image of a person) into a novel scene. Our method, which builds on top of Stable Diffusion, yields natural looking images while being highly controllable with text and pose. To accomplish this we need to train on pairs of images, the first a reference image with the person, the second a "target image" showing the same person (with a different pose and possibly in a different background). Additionally we require a text caption describing the new pose relative to that in the reference image. In this paper we present a novel dataset following this criteria, which we create using pairs of frames from human-centric and action-rich videos and employing a multimodal LLM to automatically summarize the difference in human pose for the text captions. We demonstrate that identity preservation is a more challenging task in scenes "in-the-wild", and especially scenes where there is an interaction between persons and objects. Combining the weak supervision from noisy captions, with robust 2D pose improves the quality of person-object interactions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes