74.5CVMay 29
TokTalk: Expressive Real-time Facial Animation from Audio-LLM TokensQingcheng Zhao, Yifang Pan, Karan Singh
Recent advances in Audio-LLMs like GPT-4o have ushered in an era of conversational interaction with language models. Conversational avatars however, still seem robotic in facial expression and conversational flow, in part due to sequential stages of speech recognition, text generation, turn-based text response, speech synthesis, and audio driven facial animation. Based on our insight that audio-tokens produced by current Audio-LLMs carry sufficient information to reconstruct a plausible facial performance, we present TokTalk, a system that directly outputs expressive facial animation in real-time from streaming audio-tokens. We construct a novel audio-token to 3D facial motion dataset, on which TokTalk is trained using a Chunk-based Conditional Flow Matching model. A lightweight adaptation strategy allows our trained model to seamlessly connect to any token-based Audio-LLM at minimal computational overhead. Our chunk-based processing further enables parametric trade-off between latency and facial quality, shown through ablation studies. We further show that the real-time performance of TokTalk is comparable in latency to prior art solutions, and significantly favorable (via a perceptual study) in terms of quality, expressivity and control of the 3D facial performance. We showcase TokTalk's flexibility using a chatbot Avatar, a voice-driven user Avatar, and an animation Director's interface, as diverse audio-visual face applications.
CVJan 28, 2024
Media2Face: Co-speech Facial Animation Generation With Multi-Modality GuidanceQingcheng Zhao, Pengyu Long, Qixuan Zhang et al.
The synthesis of 3D facial animations from speech has garnered considerable attention. Due to the scarcity of high-quality 4D facial data and well-annotated abundant multi-modality labels, previous methods often suffer from limited realism and a lack of lexible conditioning. We address this challenge through a trilogy. We first introduce Generalized Neural Parametric Facial Asset (GNPFA), an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space, decoupling expressions and identities. Then, we utilize GNPFA to extract high-quality expressions and accurate head poses from a large array of videos. This presents the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial animation dataset with well-annotated emotional and style labels. Finally, we propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation, accepting rich multi-modality guidances from audio, text, and image. Extensive experiments demonstrate that our model not only achieves high fidelity in facial animation synthesis but also broadens the scope of expressiveness and style adaptability in 3D facial animation.
CVJul 30, 2025
DepR: Depth Guided Single-view Scene Reconstruction with Instance-level DiffusionQingcheng Zhao, Xiang Zhang, Haiyang Xu et al.
We propose DepR, a depth-guided single-view scene reconstruction framework that integrates instance-level diffusion within a compositional paradigm. Instead of reconstructing the entire scene holistically, DepR generates individual objects and subsequently composes them into a coherent 3D layout. Unlike previous methods that use depth solely for object layout estimation during inference and therefore fail to fully exploit its rich geometric information, DepR leverages depth throughout both training and inference. Specifically, we introduce depth-guided conditioning to effectively encode shape priors into diffusion models. During inference, depth further guides DDIM sampling and layout optimization, enhancing alignment between the reconstruction and the input image. Despite being trained on limited synthetic data, DepR achieves state-of-the-art performance and demonstrates strong generalization in single-view scene reconstruction, as shown through evaluations on both synthetic and real-world datasets.
CVFeb 10, 2025
TANGLED: Generating 3D Hair Strands from Images with Arbitrary Styles and ViewpointsPengyu Long, Zijun Zhao, Min Ouyang et al.
Hairstyles are intricate and culturally significant with various geometries, textures, and structures. Existing text or image-guided generation methods fail to handle the richness and complexity of diverse styles. We present TANGLED, a novel approach for 3D hair strand generation that accommodates diverse image inputs across styles, viewpoints, and quantities of input views. TANGLED employs a three-step pipeline. First, our MultiHair Dataset provides 457 diverse hairstyles annotated with 74 attributes, emphasizing complex and culturally significant styles to improve model generalization. Second, we propose a diffusion framework conditioned on multi-view linearts that can capture topological cues (e.g., strand density and parting lines) while filtering out noise. By leveraging a latent diffusion model with cross-attention on lineart features, our method achieves flexible and robust 3D hair generation across diverse input conditions. Third, a parametric post-processing module enforces braid-specific constraints to maintain coherence in complex structures. This framework not only advances hairstyle realism and diversity but also enables culturally inclusive digital avatars and novel applications like sketch-based 3D strand editing for animation and augmented reality.