CVApr 14

3DRealHead: Few-Shot Detailed Head Avatar

Jalees Nehvi, Timo Bolkart, Thabo Beeler, Justus Thies

arXiv:2604.1317170.5h-index: 34

Predicted impact top 42% in CV · last 90 daysOriginality Incremental advance

AI Analysis

For immersive applications requiring realistic digital avatars, this method improves expressivity and identity fidelity with minimal input, though it is incremental over existing 3D head avatar approaches.

3DRealHead introduces a few-shot method for reconstructing detailed 3D head avatars from a few images, using a novel expression control signal from monocular video to capture person-specific features like the mouth and teeth, achieving higher expressivity than 3DMM-based methods.

The human face is central to communication. For immersive applications, the digital presence of a person should mirror the physical reality, capturing the users idiosyncrasies and detailed facial expressions. However, current 3D head avatar methods often struggle to faithfully reproduce the identity and facial expressions, despite having multi-view data or learned priors. Learning priors that capture the diversity of human appearances, especially, for regions with highly person-specific features, like the mouth and teeth region is challenging as the underlying training data is limited. In addition, many of the avatar methods are purely relying on 3D morphable model-based expression control which strongly limits expressivity. To address these challenges, we are introducing 3DRealHead, a few-shot head avatar reconstruction method with a novel expression control signal that is extracted from a monocular video stream of the subject. Specifically, the subject can take a few pictures of themselves, recover a 3D head avatar and drive it with a consumer-level webcam. The avatar reconstruction is enabled via a novel few-shot inversion process of a 3D human head prior which is represented as a Style U-Net that emits 3D Gaussian primitives which can be rendered under novel views. The prior is learned on the NeRSemble dataset. For animating the avatar, the U-Net is conditioned on 3DMM-based facial expression signals, as well as features of the mouth region extracted from the driving video. These additional mouth features allow us to recover facial expressions that cannot be represented by the 3DMM leading to a higher expressivity and closer resemblance to the physical reality.

View on arXiv PDF

Similar