78.3CVMay 9Code
Egocentric Whole-Body Human Mesh Recovery with Prior-Guided LearningSoyeon Na, Seung Young Noh, Ju Yong Chang
Egocentric human mesh recovery (HMR) from monocular head-mounted cameras is increasingly important for AR/VR applications, but remains challenging due to the lack of reliable ground-truth (GT) annotations based on parametric human body models such as SMPL and SMPL-X for real egocentric images. Existing egocentric HMR methods typically rely on pseudo-GT and focus on body pose estimation, which limits their ability to recover fine-grained whole-body details such as hands and face. We study egocentric whole-body human mesh recovery and propose a prior-guided learning framework that reconstructs whole-body meshes from a single egocentric image. We construct more accurate optimization-based pseudo-GT aligned with 3D joint supervision, and leverage multiple priors by adapting an exocentric HMR foundation model together with a diffusion-based pose prior. A deterministic undistortion module is further adopted to handle fisheye distortions in egocentric images. Experiments across multiple egocentric benchmarks demonstrate improved whole-body reconstruction compared to state-of-the-art methods, and show that our optimization-based pseudo-GT is substantially more accurate than existing regression-based pseudo-GT. To facilitate reproducibility, the code and dataset annotations are publicly available at https://github.com/naso06/EgoSMPLX.
11.3CVMar 16
PHAC: Promptable Human Amodal CompletionSeung Young Noh, Ju Yong Chang
Conditional image generation methods are increasingly used in human-centric applications, yet existing human amodal completion (HAC) models offer users limited control over the completed content. Given an occluded person image, they hallucinate invisible regions while preserving visible ones, but cannot reliably incorporate user-specified constraints such as a desired pose or spatial extent. As a result, users often resort to repeatedly sampling the model until they obtain a satisfactory output. Pose-guided person image synthesis (PGPIS) methods allow explicit pose conditioning, but frequently fail to preserve the instance-specific visible appearance and tend to be biased toward the training distribution, even when built on strong diffusion model priors. To address these limitations, we introduce promptable human amodal completion (PHAC), a new task that completes occluded human images while satisfying both visible appearance constraints and multiple user prompts. Users provide simple point-based prompts, such as additional joints for the target pose or bounding boxes for desired regions; these prompts are encoded using ControlNet modules specialized for each prompt type. These modules inject the prompt signals into a pre-trained diffusion model, and we fine-tune only the cross-attention blocks to obtain strong prompt alignment without degrading the underlying generative prior. To further preserve visible content, we propose an inpainting-based refinement module that starts from a slightly noised coarse completion, faithfully preserves the visible regions, and ensures seamless blending at occlusion boundaries. Extensive experiments on the HAC and PGPIS benchmarks show that our approach yields more physically plausible and higher-quality completions, while significantly improving prompt alignment compared with existing amodal completion and pose-guided synthesis methods.
CVAug 18, 2025
Stable Diffusion-Based Approach for Human De-OcclusionSeung Young Noh, Ju Yong Chang
Humans can infer the missing parts of an occluded object by leveraging prior knowledge and visible cues. However, enabling deep learning models to accurately predict such occluded regions remains a challenging task. De-occlusion addresses this problem by reconstructing both the mask and RGB appearance. In this work, we focus on human de-occlusion, specifically targeting the recovery of occluded body structures and appearances. Our approach decomposes the task into two stages: mask completion and RGB completion. The first stage leverages a diffusion-based human body prior to provide a comprehensive representation of body structure, combined with occluded joint heatmaps that offer explicit spatial cues about missing regions. The reconstructed amodal mask then serves as a conditioning input for the second stage, guiding the model on which areas require RGB reconstruction. To further enhance RGB generation, we incorporate human-specific textual features derived using a visual question answering (VQA) model and encoded via a CLIP encoder. RGB completion is performed using Stable Diffusion, with decoder fine-tuning applied to mitigate pixel-level degradation in visible regions -- a known limitation of prior diffusion-based de-occlusion methods caused by latent space transformations. Our method effectively reconstructs human appearances even under severe occlusions and consistently outperforms existing methods in both mask and RGB completion. Moreover, the de-occluded images generated by our approach can improve the performance of downstream human-centric tasks, such as 2D pose estimation and 3D human reconstruction. The code will be made publicly available.