GRAICVSDASMay 13, 2025

M3G: Multi-Granular Gesture Generator for Audio-Driven Full-Body Human Motion Synthesis

arXiv:2505.08293v21 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses the challenge of creating natural and expressive virtual avatars for applications like animation or VR, representing an incremental improvement by modeling variable temporal granularities in gestures.

The paper tackles the problem of generating full-body human gestures from audio for virtual avatar creation, proposing the M3G framework with a Multi-Granular VQ-VAE and token predictor, and reports that it outperforms state-of-the-art methods in objective and subjective evaluations.

Generating full-body human gestures encompassing face, body, hands, and global movements from audio is a valuable yet challenging task in virtual avatar creation. Previous systems focused on tokenizing the human gestures framewisely and predicting the tokens of each frame from the input audio. However, one observation is that the number of frames required for a complete expressive human gesture, defined as granularity, varies among different human gesture patterns. Existing systems fail to model these gesture patterns due to the fixed granularity of their gesture tokens. To solve this problem, we propose a novel framework named Multi-Granular Gesture Generator (M3G) for audio-driven holistic gesture generation. In M3G, we propose a novel Multi-Granular VQ-VAE (MGVQ-VAE) to tokenize motion patterns and reconstruct motion sequences from different temporal granularities. Subsequently, we proposed a multi-granular token predictor that extracts multi-granular information from audio and predicts the corresponding motion tokens. Then M3G reconstructs the human gestures from the predicted tokens using the MGVQ-VAE. Both objective and subjective experiments demonstrate that our proposed M3G framework outperforms the state-of-the-art methods in terms of generating natural and expressive full-body human gestures.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes