CVOct 14, 2024Code
MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal SamplingYue Zhang, Zhizhou Zhong, Minhao Liu et al.
Real-time video dubbing that preserves identity consistency while achieving accurate lip synchronization remains a critical challenge. Existing approaches face a trilemma: diffusion-based methods achieve high visual fidelity but suffer from prohibitive computational costs, while GAN-based solutions sacrifice lip-sync accuracy or dental details for real-time performance. We present MuseTalk, a novel two-stage training framework that resolves this trade-off through latent space optimization and spatio-temporal data sampling strategy. Our key innovations include: (1) During the Facial Abstract Pretraining stage, we propose Informative Frame Sampling to temporally align reference-source pose pairs, eliminating redundant feature interference while preserving identity cues. (2) In the Lip-Sync Adversarial Finetuning stage, we employ Dynamic Margin Sampling to spatially select the most suitable lip-movement-promoting regions, balancing audio-visual synchronization and dental clarity. (3) MuseTalk establishes an effective audio-visual feature fusion framework in the latent space, delivering 30 FPS output at 256*256 resolution on an NVIDIA V100 GPU. Extensive experiments demonstrate that MuseTalk outperforms state-of-the-art methods in visual fidelity while achieving comparable lip-sync accuracy. %The codes and models will be made publicly available upon acceptance. The code is made available at \href{https://github.com/TMElyralab/MuseTalk}{https://github.com/TMElyralab/MuseTalk}
CVNov 21, 2019Code
MIMAMO Net: Integrating Micro- and Macro-motion for Video Emotion RecognitionDidan Deng, Zhaokang Chen, Yuqian Zhou et al.
Spatial-temporal feature learning is of vital importance for video emotion recognition. Previous deep network structures often focused on macro-motion which extends over long time scales, e.g., on the order of seconds. We believe integrating structures capturing information about both micro- and macro-motion will benefit emotion prediction, because human perceive both micro- and macro-expressions. In this paper, we propose to combine micro- and macro-motion features to improve video emotion recognition with a two-stream recurrent network, named MIMAMO (Micro-Macro-Motion) Net. Specifically, smaller and shorter micro-motions are analyzed by a two-stream network, while larger and more sustained macro-motions can be well captured by a subsequent recurrent network. Assigning specific interpretations to the roles of different parts of the network enables us to make choice of parameters based on prior knowledge: choices that turn out to be optimal. One of the important innovations in our model is the use of interframe phase differences rather than optical flow as input to the temporal stream. Compared with the optical flow, phase differences require less computation and are more robust to illumination changes. Our proposed network achieves state of the art performance on two video emotion datasets, the OMG emotion dataset and the Aff-Wild dataset. The most significant gains are for arousal prediction, for which motion information is intuitively more informative. Source code is available at https://github.com/wtomin/MIMAMO-Net.
CVFeb 10, 2020
Multitask Emotion Recognition with Incomplete LabelsDidan Deng, Zhaokang Chen, Bertram E. Shi
We train a unified model to perform three tasks: facial action unit detection, expression classification, and valence-arousal estimation. We address two main challenges of learning the three tasks. First, most existing datasets are highly imbalanced. Second, most existing datasets do not contain labels for all three tasks. To tackle the first challenge, we apply data balancing techniques to experimental datasets. To tackle the second challenge, we propose an algorithm for the multitask model to learn from missing (incomplete) labels. This algorithm has two steps. We first train a teacher model to perform all three tasks, where each instance is trained by the ground truth label of its corresponding task. Secondly, we refer to the outputs of the teacher model as the soft labels. We use the soft labels and the ground truth to train the student model. We find that most of the student models outperform their teacher model on all the three tasks. Finally, we use model ensembling to boost performance further on the three tasks.
CVJan 25, 2020
Towards High Performance Low Complexity Calibration in Appearance Based Gaze EstimationZhaokang Chen, Bertram E. Shi
Appearance-based gaze estimation from RGB images provides relatively unconstrained gaze tracking. We have previously proposed a gaze decomposition method that decomposes the gaze angle into the sum of a subject-independent gaze estimate from the image and a subject-dependent bias. This paper extends that work with a more complete characterization of the interplay between the complexity of the calibration dataset and estimation accuracy. We analyze the effect of the number of gaze targets, the number of images used per gaze target and the number of head positions in calibration data using a new NISLGaze dataset, which is well suited for analyzing these effects as it includes more diversity in head positions and orientations for each subject than other datasets. A better understanding of these factors enables low complexity high performance calibration. Our results indicate that using only a single gaze target and single head position is sufficient to achieve high quality calibration, outperforming state-of-the-art methods by more than 6.3%. One of the surprising findings is that the same estimator yields the best performance both with and without calibration. To better understand the reasons, we provide a new theoretical analysis that specifies the conditions under which this can be expected.
CVMay 11, 2019
Offset Calibration for Appearance-Based Gaze Estimation via Gaze DecompositionZhaokang Chen, Bertram E. Shi
Appearance-based gaze estimation provides relatively unconstrained gaze tracking. However, subject-independent models achieve limited accuracy partly due to individual variations. To improve estimation, we propose a novel gaze decomposition method and a single gaze point calibration method, motivated by our finding that the inter-subject squared bias exceeds the intra-subject variance for a subject-independent estimator. We decompose the gaze angle into a subject-dependent bias term and a subject-independent term between the gaze angle and the bias. The subject-independent term is estimated by a deep convolutional network. For calibration-free tracking, we set the subject-dependent bias term to zero. For single gaze point calibration, we estimate the bias from a few images taken as the subject gazes at a point. Experiments on three datasets indicate that as a calibration-free estimator, the proposed method outperforms the state-of-the-art methods by up to $10.0\%$. The proposed calibration method is robust and reduces estimation error significantly (up to $35.6\%$), achieving state-of-the-art performance for appearance-based eye trackers with calibration.
CVMar 18, 2019
Appearance-Based Gaze Estimation Using Dilated-ConvolutionsZhaokang Chen, Bertram E. Shi
Appearance-based gaze estimation has attracted more and more attention because of its wide range of applications. The use of deep convolutional neural networks has improved the accuracy significantly. In order to improve the estimation accuracy further, we focus on extracting better features from eye images. Relatively large changes in gaze angles may result in relatively small changes in eye appearance. We argue that current architectures for gaze estimation may not be able to capture such small changes, as they apply multiple pooling layers or other downsampling layers so that the spatial resolution of the high-level layers is reduced significantly. To evaluate whether the use of features extracted at high resolution can benefit gaze estimation, we adopt dilated-convolutions to extract high-level features without reducing spatial resolution. In cross-subject experiments on the Columbia Gaze dataset for eye contact detection and the MPIIGaze dataset for 3D gaze vector regression, the resulting Dilated-Nets achieve significant (up to 20.8%) gains when compared to similar networks without dilated-convolutions. Our proposed Dilated-Net achieves state-of-the-art results on both the Columbia Gaze and the MPIIGaze datasets.
HCApr 21, 2017
Using Variable Dwell Time to Accelerate Gaze-Based Web Browsing with Two-Step SelectionZhaokang Chen, Bertram E. Shi
In order to avoid the "Midas Touch" problem, gaze-based interfaces for selection often introduce a dwell time: a fixed amount of time the user must fixate upon an object before it is selected. Past interfaces have used a uniform dwell time across all objects. Here, we propose a gaze-based browser using a two-step selection policy with variable dwell time. In the first step, a command, e.g. "back" or "select", is chosen from a menu using a dwell time that is constant across the different commands. In the second step, if the "select" command is chosen, the user selects a hyperlink using a dwell time that varies between different hyperlinks. We assign shorter dwell times to more likely hyperlinks and longer dwell times to less likely hyperlinks. In order to infer the likelihood each hyperlink will be selected, we have developed a probabilistic model of natural gaze behavior while surfing the web. We have evaluated a number of heuristic and probabilistic methods for varying the dwell times using both simulation and experiment. Our results demonstrate that varying dwell time improves the user experience in comparison with fixed dwell time, resulting in fewer errors and increased speed. While all of the methods for varying dwell time resulted in improved performance, the probabilistic models yielded much greater gains than the simple heuristics. The best performing model reduces error rate by 50% compared to 100ms uniform dwell time while maintaining a similar response time. It reduces response time by 60% compared to 300ms uniform dwell time while maintaining a similar error rate.