CVApr 13, 2022
Dynamic Neural Textures: Generating Talking-Face Videos with Continuously Controllable ExpressionsZipeng Ye, Zhiyao Sun, Yu-Hui Wen et al. · tsinghua
Recently, talking-face video generation has received considerable attention. So far most methods generate results with neutral expressions or expressions that are implicitly determined by neural networks in an uncontrollable way. In this paper, we propose a method to generate talking-face videos with continuously controllable expressions in real-time. Our method is based on an important observation: In contrast to facial geometry of moderate resolution, most expression information lies in textures. Then we make use of neural textures to generate high-quality talking face videos and design a novel neural network that can generate neural textures for image frames (which we called dynamic neural textures) based on the input expression and continuous intensity expression coding (CIEC). Our method uses 3DMM as a 3D model to sample the dynamic neural texture. The 3DMM does not cover the teeth area, so we propose a teeth submodule to complete the details in teeth. Results and an ablation study show the effectiveness of our method in generating high-quality talking-face videos with continuously controllable expressions. We also set up four baseline methods by combining existing representative methods and compare them with our method. Experimental results including a user study show that our method has the best performance.
CVSep 17, 2022
Continuously Controllable Facial Expression Editing in Talking Face VideosZhiyao Sun, Yu-Hui Wen, Tian Lv et al.
Recently audio-driven talking face video generation has attracted considerable attention. However, very few researches address the issue of emotional editing of these talking face videos with continuously controllable expressions, which is a strong demand in the industry. The challenge is that speech-related expressions and emotion-related expressions are often highly coupled. Meanwhile, traditional image-to-image translation methods cannot work well in our application due to the coupling of expressions with other attributes such as poses, i.e., translating the expression of the character in each frame may simultaneously change the head pose due to the bias of the training data distribution. In this paper, we propose a high-quality facial expression editing method for talking face videos, allowing the user to control the target emotion in the edited video continuously. We present a new perspective for this task as a special case of motion information editing, where we use a 3DMM to capture major facial movements and an associated texture map modeled by a StyleGAN to capture appearance details. Both representations (3DMM and texture map) contain emotional information and can be continuously modified by neural networks and easily smoothed by averaging in coefficient/latent spaces, making our method simple yet effective. We also introduce a mouth shape preservation loss to control the trade-off between lip synchronization and the degree of exaggeration of the edited expression. Extensive experiments and a user study show that our method achieves state-of-the-art performance across various evaluation criteria.
CVMay 28, 2022Code
MDMLP: Image Classification from Scratch on Small Datasets with MLPTian Lv, Chongyang Bai, Chaojie Wang
The attention mechanism has become a go-to technique for natural language processing and computer vision tasks. Recently, the MLP-Mixer and other MLP-based architectures, based simply on multi-layer perceptrons (MLPs), are also powerful compared to CNNs and attention techniques and raises a new research direction. However, the high capability of the MLP-based networks severely relies on large volume of training data, and lacks of explanation ability compared to the Vision Transformer (ViT) or ConvNets. When trained on small datasets, they usually achieved inferior results than ConvNets. To resolve it, we present (i) multi-dimensional MLP (MDMLP), a conceptually simple and lightweight MLP-based architecture yet achieves SOTA when training from scratch on small-size datasets; (ii) multi-dimension MLP Attention Tool (MDAttnTool), a novel and efficient attention mechanism based on MLPs. Even without strong data augmentation, MDMLP achieves 90.90% accuracy on CIFAR10 with only 0.3M parameters, while the well-known MLP-Mixer achieves 85.45% with 17.1M parameters. In addition, the lightweight MDAttnTool highlights objects in images, indicating its explanation power. Our code is available at https://github.com/Amoza-Theodore/MDMLP.
CVSep 30, 2023
DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion ModelsZhiyao Sun, Tian Lv, Sheng Ye et al.
The generation of stylistic 3D facial animations driven by speech presents a significant challenge as it requires learning a many-to-many mapping between speech, style, and the corresponding natural facial motion. However, existing methods either employ a deterministic model for speech-to-motion mapping or encode the style using a one-hot encoding scheme. Notably, the one-hot encoding approach fails to capture the complexity of the style and thus limits generalization ability. In this paper, we propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder that extracts style embeddings from short reference videos. During inference, we employ classifier-free guidance to guide the generation process based on the speech and style. In particular, our style includes the generation of head poses, thereby enhancing user perception. Additionally, we address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset. Extensive experiments and user study demonstrate that our approach outperforms state-of-the-art methods. The code and dataset are at https://diffposetalk.github.io .
CVMar 3
SemGS: Feed-Forward Semantic 3D Gaussian Splatting from Sparse Views for Generalizable Scene UnderstandingSheng Ye, Zhen-Hui Dong, Ruoyu Fan et al.
Semantic understanding of 3D scenes is essential for robots to operate effectively and safely in complex environments. Existing methods for semantic scene reconstruction and semantic-aware novel view synthesis often rely on dense multi-view inputs and require scene-specific optimization, limiting their practicality and scalability in real-world applications. To address these challenges, we propose SemGS, a feed-forward framework for reconstructing generalizable semantic fields from sparse image inputs. SemGS uses a dual-branch architecture to extract color and semantic features, where the two branches share shallow CNN layers, allowing semantic reasoning to leverage textural and structural cues in color appearance. We also incorporate a camera-aware attention mechanism into the feature extractor to explicitly model geometric relationships between camera viewpoints. The extracted features are decoded into dual-Gaussians that share geometric consistency while preserving branch-specific attributes, and further rasterized to synthesize semantic maps under novel viewpoints. Additionally, we introduce a regional smoothness loss to enhance semantic coherence. Experiments show that SemGS achieves state-of-the-art performance on benchmark datasets, while providing rapid inference and strong generalization capabilities across diverse synthetic and real-world scenarios.
CLJan 18, 2021Code
Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-Level Backdoor AttacksZhengyan Zhang, Guangxuan Xiao, Yongwei Li et al.
Pre-trained models (PTMs) have been widely used in various downstream tasks. The parameters of PTMs are distributed on the Internet and may suffer backdoor attacks. In this work, we demonstrate the universal vulnerability of PTMs, where fine-tuned PTMs can be easily controlled by backdoor attacks in arbitrary downstream tasks. Specifically, attackers can add a simple pre-training task, which restricts the output representations of trigger instances to pre-defined vectors, namely neuron-level backdoor attack (NeuBA). If the backdoor functionality is not eliminated during fine-tuning, the triggers can make the fine-tuned model predict fixed labels by pre-defined vectors. In the experiments of both natural language processing (NLP) and computer vision (CV), we show that NeuBA absolutely controls the predictions for trigger instances without any knowledge of downstream tasks. Finally, we apply several defense methods to NeuBA and find that model pruning is a promising direction to resist NeuBA by excluding backdoored neurons. Our findings sound a red alarm for the wide use of PTMs. Our source code and models are available at \url{https://github.com/thunlp/NeuBA}.
CVMar 15, 2025
Tailor: An Integrated Text-Driven CG-Ready Human and Garment Generation SystemZhiyao Sun, Yu-Hui Wen, Matthieu Lin et al.
Creating detailed 3D human avatars with garments typically requires specialized expertise and labor-intensive processes. Although recent advances in generative AI have enabled text-to-3D human/clothing generation, current methods fall short in offering accessible, integrated pipelines for producing ready-to-use clothed avatars. To solve this, we introduce Tailor, an integrated text-to-avatar system that generates high-fidelity, customizable 3D humans with simulation-ready garments. Our system includes a three-stage pipeline. We first employ a large language model to interpret textual descriptions into parameterized body shapes and semantically matched garment templates. Next, we develop topology-preserving deformation with novel geometric losses to adapt garments precisely to body geometries. Furthermore, an enhanced texture diffusion module with a symmetric local attention mechanism ensures both view consistency and photorealistic details. Quantitative and qualitative evaluations demonstrate that Tailor outperforms existing SoTA methods in terms of fidelity, usability, and diversity. Code will be available for academic use.