CVJun 11, 2024

FaceGPT: Self-supervised Learning to Chat about 3D Human Faces

arXiv:2406.07163v15 citations
Originality Incremental advance
AI Analysis

This addresses the limitation of specialized 3D face reconstruction methods lacking semantic reasoning, offering a self-supervised approach for human face analysis.

The paper tackles the problem of enabling Large Vision-Language Models to reason about 3D human faces from images and text, achieving high-quality 3D face reconstructions without expensive 3D annotations.

We introduce FaceGPT, a self-supervised learning framework for Large Vision-Language Models (VLMs) to reason about 3D human faces from images and text. Typical 3D face reconstruction methods are specialized algorithms that lack semantic reasoning capabilities. FaceGPT overcomes this limitation by embedding the parameters of a 3D morphable face model (3DMM) into the token space of a VLM, enabling the generation of 3D faces from both textual and visual inputs. FaceGPT is trained in a self-supervised manner as a model-based autoencoder from in-the-wild images. In particular, the hidden state of LLM is projected into 3DMM parameters and subsequently rendered as 2D face image to guide the self-supervised learning process via image-based reconstruction. Without relying on expensive 3D annotations of human faces, FaceGPT obtains a detailed understanding about 3D human faces, while preserving the capacity to understand general user instructions. Our experiments demonstrate that FaceGPT not only achieves high-quality 3D face reconstructions but also retains the ability for general-purpose visual instruction following. Furthermore, FaceGPT learns fully self-supervised to generate 3D faces based on complex textual inputs, which opens a new direction in human face analysis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes