CVOct 21, 2024

Disambiguating Monocular Reconstruction of 3D Clothed Human with Spatial-Temporal Transformer

arXiv:2410.16337v1h-index: 2
Originality Incremental advance
AI Analysis

This work improves 3D human reconstruction for computer vision applications, but it is incremental as it builds on existing implicit function and parametric model approaches.

The paper tackles the problem of reconstructing 3D clothed humans from monocular camera data by addressing ambiguities in back details and local image regions, proposing a Spatial-Temporal Transformer network that outperforms state-of-the-art methods on datasets like Adobe and MonoPerfCap.

Reconstructing 3D clothed humans from monocular camera data is highly challenging due to viewpoint limitations and image ambiguity. While implicit function-based approaches, combined with prior knowledge from parametric models, have made significant progress, there are still two notable problems. Firstly, the back details of human models are ambiguous due to viewpoint invisibility. The quality of the back details depends on the back normal map predicted by a convolutional neural network (CNN). However, the CNN lacks global information awareness for comprehending the back texture, resulting in excessively smooth back details. Secondly, a single image suffers from local ambiguity due to lighting conditions and body movement. However, implicit functions are highly sensitive to pixel variations in ambiguous regions. To address these ambiguities, we propose the Spatial-Temporal Transformer (STT) network for 3D clothed human reconstruction. A spatial transformer is employed to extract global information for normal map prediction. The establishment of global correlations facilitates the network in comprehending the holistic texture and shape of the human body. Simultaneously, to compensate for local ambiguity in images, a temporal transformer is utilized to extract temporal features from adjacent frames. The incorporation of temporal features can enhance the accuracy of input features in implicit networks. Furthermore, to obtain more accurate temporal features, joint tokens are employed to establish local correspondences between frames. Experimental results on the Adobe and MonoPerfCap datasets have shown that our method outperforms state-of-the-art methods and maintains robust generalization even under low-light outdoor conditions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes