Zeyu Ling

CV
h-index9
5papers
10citations
Novelty51%
AI Score38

5 Papers

CVSep 6, 2023
MCM: Multi-condition Motion Synthesis Framework for Multi-scenario

Zeyu Ling, Bo Han, Yongkang Wong et al.

The objective of the multi-condition human motion synthesis task is to incorporate diverse conditional inputs, encompassing various forms like text, music, speech, and more. This endows the task with the capability to adapt across multiple scenarios, ranging from text-to-motion and music-to-dance, among others. While existing research has primarily focused on single conditions, the multi-condition human motion generation remains underexplored. In this paper, we address these challenges by introducing MCM, a novel paradigm for motion synthesis that spans multiple scenarios under diverse conditions. The MCM framework is able to integrate with any DDPM-like diffusion model to accommodate multi-conditional information input while preserving its generative capabilities. Specifically, MCM employs two-branch architecture consisting of a main branch and a control branch. The control branch shares the same structure as the main branch and is initialized with the parameters of the main branch, effectively maintaining the generation ability of the main branch and supporting multi-condition input. We also introduce a Transformer-based diffusion model MWNet (DDPM-like) as our main branch that can capture the spatial complexity and inter-joint correlations in motion sequences through a channel-dimension self-attention module. Quantitative comparisons demonstrate that our approach achieves SoTA results in both text-to-motion and competitive results in music-to-dance tasks, comparable to task-specific methods. Furthermore, the qualitative evaluation shows that MCM not only streamlines the adaptation of methodologies originally designed for text-to-motion tasks to domains like music-to-dance and speech-to-gesture, eliminating the need for extensive network re-configurations but also enables effective multi-condition modal control, realizing "once trained is motion need".

AIOct 11, 2025
SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation

Zeyu Ling, Xiaodong Gu, Jiangnan Tang et al.

We introduce SyncLipMAE, a self-supervised pretraining framework for talking-face video that learns synchronization-aware and transferable facial dynamics from unlabeled audio-visual streams. Our approach couples masked visual modeling with cross-modal contrastive alignment and employs three per-frame prompt tokens that explicitly encode the essential factors of a talking-face frame - identity, vocal motion (speech-synchronized facial dynamics), and ambient motion (audio-agnostic movements such as blinks and head pose). The contrastive objective uses time-aligned vocal-motion and audio tokens as positives and misaligned pairs as negatives, driving both modalities into a shared embedding space and yielding token-level audio-visual stream synchronization. After pretraining, the aligned audio tokens together with the visual prompt tokens (identity, vocal motion, ambient motion) form a unified interface for four disparate downstream settings: (i) audio-visual stream synchronization; (ii) facial emotion and head/face action recognition; (iii) visual speech recognition; and (iv) visual dubbing, for which we enable indistinguishable audio- or video-driven control within a single model. Across four task families that require distinct capabilities, SyncLipMAE achieves state-of-the-art results, underscoring the effectiveness of synchronization-aware, factorized self-supervised pretraining.

CVJul 10, 2025
EPIC: Efficient Prompt Interaction for Text-Image Classification

Xinyao Yu, Hao Sun, Zeyu Ling et al.

In recent years, large-scale pre-trained multimodal models (LMMs) generally emerge to integrate the vision and language modalities, achieving considerable success in multimodal tasks, such as text-image classification. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this context, we propose a novel efficient prompt-based multimodal interaction strategy, namely Efficient Prompt Interaction for text-image Classification (EPIC). Specifically, we utilize temporal prompts on intermediate layers, and integrate different modalities with similarity-based prompt interaction, to leverage sufficient information exchange between modalities. Utilizing this approach, our method achieves reduced computational resource consumption and fewer trainable parameters (about 1\% of the foundation model) compared to other fine-tuning strategies. Furthermore, it demonstrates superior performance on the UPMC-Food101 and SNLI-VE datasets, while achieving comparable performance on the MM-IMDB dataset.

CVNov 26, 2024
VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension

Zeyu Ling, Bo Han, Shiyang Li et al.

Large language models (LLMs) are, by design, inherently capable of multi-task learning: through a unified next-token prediction paradigm, they can naturally address a wide variety of downstream tasks. Prior work in the motion domain has demonstrated some generality by adapting LLMs via a Motion Tokenizer coupled with an autoregressive Transformer to generate and understand human motion. However, this generality remains limited in scope and yields only modest performance gains. We introduce VersatileMotion, a unified multimodal motion LLM that combines a novel motion tokenizer, integrating VQ-VAE with flow matching, and an autoregressive transformer backbone to seamlessly support at least nine distinct motion-related tasks. VersatileMotion is the first method to handle single-agent and multi-agent motions in a single framework and enable cross-modal conversion between motion, text, music, and speech, achieving state-of-the-art performance on seven of these tasks. Each sequence in MotionHub may include one or more of the following annotations: natural-language captions, music or audio clips, speech transcripts, and multi-agent interaction data. To facilitate evaluation, we define and release benchmark splits covering nine core tasks. Extensive experiments demonstrate the superior performance, versatility, and potential of VersatileMotion as a foundational model for future understanding and generation of motion.

CVApr 19, 2024
MCM: Multi-condition Motion Synthesis Framework

Zeyu Ling, Bo Han, Yongkang Wongkan et al.

Conditional human motion synthesis (HMS) aims to generate human motion sequences that conform to specific conditions. Text and audio represent the two predominant modalities employed as HMS control conditions. While existing research has primarily focused on single conditions, the multi-condition human motion synthesis remains underexplored. In this study, we propose a multi-condition HMS framework, termed MCM, based on a dual-branch structure composed of a main branch and a control branch. This framework effectively extends the applicability of the diffusion model, which is initially predicated solely on textual conditions, to auditory conditions. This extension encompasses both music-to-dance and co-speech HMS while preserving the intrinsic quality of motion and the capabilities for semantic association inherent in the original model. Furthermore, we propose the implementation of a Transformer-based diffusion model, designated as MWNet, as the main branch. This model adeptly apprehends the spatial intricacies and inter-joint correlations inherent in motion sequences, facilitated by the integration of multi-wise self-attention modules. Extensive experiments show that our method achieves competitive results in single-condition and multi-condition HMS tasks.