CVNov 2, 2023Code
POS: A Prompts Optimization Suite for Augmenting Text-to-Video GenerationShijie Ma, Huayi Xu, Mengjian Li et al.
This paper targets to enhance the diffusion-based text-to-video generation by improving the two input prompts, including the noise and the text. Accommodated with this goal, we propose POS, a training-free Prompt Optimization Suite to boost text-to-video models. POS is motivated by two observations: (1) Video generation shows instability in terms of noise. Given the same text, different noises lead to videos that differ significantly in terms of both frame quality and temporal consistency. This observation implies that there exists an optimal noise matched to each textual input; To capture the potential noise, we propose an optimal noise approximator to approach the potential optimal noise. Particularly, the optimal noise approximator initially searches a video that closely relates to the text prompt and then inverts it into the noise space to serve as an improved noise prompt for the textual input. (2) Improving the text prompt via LLMs often causes semantic deviation. Many existing text-to-vision works have utilized LLMs to improve the text prompts for generation enhancement. However, existing methods often neglect the semantic alignment between the original text and the rewritten one. In response to this issue, we design a semantic-preserving rewriter to impose contraints in both rewritng and denoising phrases to preserve the semantic consistency. Extensive experiments on popular benchmarks show that our POS can improve the text-to-video models with a clear margin. The code will be open-sourced.
SDMar 16, 2024
Speech-driven Personalized Gesture Synthetics: Harnessing Automatic Fuzzy Feature InferenceFan Zhang, Zhaohan Wang, Xin Lyu et al.
Speech-driven gesture generation is an emerging field within virtual human creation. However, a significant challenge lies in accurately determining and processing the multitude of input features (such as acoustic, semantic, emotional, personality, and even subtle unknown features). Traditional approaches, reliant on various explicit feature inputs and complex multimodal processing, constrain the expressiveness of resulting gestures and limit their applicability. To address these challenges, we present Persona-Gestor, a novel end-to-end generative model designed to generate highly personalized 3D full-body gestures solely relying on raw speech audio. The model combines a fuzzy feature extractor and a non-autoregressive Adaptive Layer Normalization (AdaLN) transformer diffusion architecture. The fuzzy feature extractor harnesses a fuzzy inference strategy that automatically infers implicit, continuous fuzzy features. These fuzzy features, represented as a unified latent feature, are fed into the AdaLN transformer. The AdaLN transformer introduces a conditional mechanism that applies a uniform function across all tokens, thereby effectively modeling the correlation between the fuzzy features and the gesture sequence. This module ensures a high level of gesture-speech synchronization while preserving naturalness. Finally, we employ the diffusion model to train and infer various gestures. Extensive subjective and objective evaluations on the Trinity, ZEGGS, and BEAT datasets confirm our model's superior performance to the current state-of-the-art approaches. Persona-Gestor improves the system's usability and generalization capabilities, setting a new benchmark in speech-driven gesture synthesis and broadening the horizon for virtual human technology. Supplementary videos and code can be accessed at https://zf223669.github.io/Diffmotion-v2-website/
GRMar 5, 2024
Towards Geometric-Photometric Joint Alignment for Facial Mesh RegistrationXizhi Wang, Yaxiong Wang, Mengjian Li
This paper presents a Geometric-Photometric Joint Alignment~(GPJA) method, which aligns discrete human expressions at pixel-level accuracy by combining geometric and photometric information. Common practices for registering human heads typically involve aligning landmarks with facial template meshes using geometry processing approaches, but often overlook dense pixel-level photometric consistency. This oversight leads to inconsistent texture parametrization across different expressions, hindering the creation of topologically consistent head meshes widely used in movies and games. GPJA overcomes this limitation by leveraging differentiable rendering to align vertices with target expressions, achieving joint alignment in both geometry and photometric appearances automatically, without requiring semantic annotation or pre-aligned meshes for training. It features a holistic rendering alignment mechanism and a multiscale regularized optimization for robust convergence on large deformation. The method utilizes derivatives at vertex positions for supervision and employs a gradient-based algorithm which guarantees smoothness and avoids topological artifacts during the geometry evolution. Experimental results demonstrate faithful alignment under various expressions, surpassing the conventional non-rigid ICP-based methods and the state-of-the-art deep learning based method. In practical, our method generates meshes of the same subject across diverse expressions, all with the same texture parametrization. This consistency benefits face animation, re-parametrization, and other batch operations for face modeling and applications with enhanced efficiency.
AINov 23, 2020
APAN: Asynchronous Propagation Attention Network for Real-time Temporal Graph EmbeddingXuhong Wang, Ding Lyu, Mengjian Li et al.
Limited by the time complexity of querying k-hop neighbors in a graph database, most graph algorithms cannot be deployed online and execute millisecond-level inference. This problem dramatically limits the potential of applying graph algorithms in certain areas, such as financial fraud detection. Therefore, we propose Asynchronous Propagation Attention Network, an asynchronous continuous time dynamic graph algorithm for real-time temporal graph embedding. Traditional graph models usually execute two serial operations: first graph computation and then model inference. We decouple model inference and graph computation step so that the heavy graph query operations will not damage the speed of model inference. Extensive experiments demonstrate that the proposed method can achieve competitive performance and 8.7 times inference speed improvement in the meantime.