CVFeb 25, 2025

Realistic Clothed Human and Object Joint Reconstruction from a Single Image

Ayushi Dutta, Marco Pesavento, Marco Volino, Adrian Hilton, Armin Mustafa

arXiv:2502.18150v21 citationsh-index: 10CVMP

Originality Highly original

AI Analysis

This addresses the challenge of capturing detailed 3D human-object interactions from monocular images, which is incremental by improving realism over existing template-based methods.

The paper tackles the problem of jointly reconstructing 3D clothed humans and objects from a single RGB image, achieving superior quality reconstructions with realistic details such as clothing, as demonstrated through extensive evaluation on synthetic and real-world datasets.

Recent approaches to jointly reconstruct 3D humans and objects from a single RGB image represent 3D shapes with template-based or coarse models, which fail to capture details of loose clothing on human bodies. In this paper, we introduce a novel implicit approach for jointly reconstructing realistic 3D clothed humans and objects from a monocular view. For the first time, we model both the human and the object with an implicit representation, allowing to capture more realistic details such as clothing. This task is extremely challenging due to human-object occlusions and the lack of 3D information in 2D images, often leading to poor detail reconstruction and depth ambiguity. To address these problems, we propose a novel attention-based neural implicit model that leverages image pixel alignment from both the input human-object image for a global understanding of the human-object scene and from local separate views of the human and object images to improve realism with, for example, clothing details. Additionally, the network is conditioned on semantic features derived from an estimated human-object pose prior, which provides 3D spatial information about the shared space of humans and objects. To handle human occlusion caused by objects, we use a generative diffusion model that inpaints the occluded regions, recovering otherwise lost details. For training and evaluation, we introduce a synthetic dataset featuring rendered scenes of inter-occluded 3D human scans and diverse objects. Extensive evaluation on both synthetic and real-world datasets demonstrates the superior quality of the proposed human-object reconstructions over competitive methods.

View on arXiv PDF

Similar