MMMar 26

Ges-QA: A Multidimensional Quality Assessment Dataset for Audio-to-3D Gesture Generation

arXiv:2508.1202020.04 citationsh-index: 24
Predicted impact top 17% in MM · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the problem of evaluating gesture quality for researchers in virtual reality and computer graphics, though it is incremental as it builds on existing datasets and methods.

The authors tackled the lack of human-aligned evaluation metrics for audio-to-3D gesture generation by introducing the Ges-QA dataset with 1,400 samples and multidimensional scores, and proposed a multi-modal transformer model that achieved state-of-the-art performance on this dataset.

The Audio-to-3D-Gesture (A2G) task has enormous potential for various applications in virtual reality and computer graphics, etc. However, current evaluation metrics, such as Fréchet Gesture Distance or Beat Constancy, fail at reflecting the human preference of the generated 3D gestures. To cope with this problem, exploring human preference and an objective quality assessment metric for AI-generated 3D human gestures is becoming increasingly significant. In this paper, we introduce the Ges-QA dataset, which includes 1,400 samples with multidimensional scores for gesture quality and audio-gesture consistency. Moreover, we collect binary classification labels to determine whether the generated gestures match the emotions of the audio. Equipped with our Ges-QA dataset, we propose a multi-modal transformer-based neural network with 3 branches for video, audio and 3D skeleton modalities, which can score A2G contents in multiple dimensions. Comparative experimental results and ablation studies demonstrate that Ges-QAer yields state-of-the-art performance on our dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes