CVNov 3, 2025Code
Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art BenchmarkRajmund Nagy, Hendric Voss, Thanh Hoang-Minh et al.
We review human evaluation practices in automated, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is. In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models -- each trained by its original authors -- across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results provide strong evidence that 1) newer models do not consistently outperform earlier approaches; 2) published claims of high motion realism or speech-gesture alignment may not hold up under rigorous evaluation; and 3) the field must adopt disentangled assessments of motion quality and multimodal alignment for accurate benchmarking in order to make progress. Finally, in order to drive standardisation and enable new evaluation research, we will release five hours of synthetic motion from the benchmarked models; over 750 rendered video stimuli from the user studies -- enabling new evaluations without model reimplementation required -- alongside our open-source rendering script, and the 16,000 pairwise human preference votes collected for our benchmark.
CVJan 10, 2023
Speech Driven Video Editing via an Audio-Conditioned Diffusion ModelDan Bigioi, Shubhajit Basak, Michał Stypułkowski et al.
Taking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing using a denoising diffusion model. Given a video of a talking person, and a separate auditory speech recording, the lip and jaw motions are re-synchronized without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model on audio mel spectral features to generate synchronised facial motion. Proof of concept results are demonstrated on both single-speaker and multi-speaker video editing, providing a baseline model on the CREMA-D audiovisual data set. To the best of our knowledge, this is the first work to demonstrate and validate the feasibility of applying end-to-end denoising diffusion models to the task of audio-driven video editing.
HCApr 30
The Impact of Navigation on Proxemics in an Immersive Virtual Environment with Conversational AgentsRose Connolly, Lauren Buck, Victor Zordan et al.
As social VR grows in popularity, understanding how to optimise interactions becomes increasingly important. Interpersonal distance (the physical space people maintain between each other) is a key aspect of user experience. Previous work in psychology has shown that breaches of personal space cause stress and discomfort. Thus, effectively managing this distance is crucial in social VR, where social interactions are frequent. Teleportation, a commonly used locomotion method in these environments, involves distinct cognitive processes and requires users to rely on their ability to estimate distance. Despite its widespread use, the effect of teleportation on proximity remains unexplored. To investigate this, we measured the interpersonal distance of 70 participants during interactions with embodied conversational agents, comparing teleportation to natural walking. Our findings revealed that participants maintained closer proximity from the agents during teleportation. Female participants kept greater distances from the agents than male participants, and natural walking was associated with higher agency and body ownership, though co-presence remained unchanged. We propose that differences in spatial perception and spatial cognitive load contribute to reduced interpersonal distance with teleportation. These findings emphasise that proximity should be a key consideration when selecting locomotion methods in social VR, highlighting the need for further research on how locomotion impacts spatial perception and social dynamics in virtual environments.
CVAug 29, 2023
A lightweight 3D dense facial landmark estimation model from position map dataShubhajit Basak, Sathish Mangapuram, Gabriel Costache et al.
The incorporation of 3D data in facial analysis tasks has gained popularity in recent years. Though it provides a more accurate and detailed representation of the human face, accruing 3D face data is more complex and expensive than 2D face images. Either one has to rely on expensive 3D scanners or depth sensors which are prone to noise. An alternative option is the reconstruction of 3D faces from uncalibrated 2D images in an unsupervised way without any ground truth 3D data. However, such approaches are computationally expensive and the learned model size is not suitable for mobile or other edge device applications. Predicting dense 3D landmarks over the whole face can overcome this issue. As there is no public dataset available containing dense landmarks, we propose a pipeline to create a dense keypoint training dataset containing 520 key points across the whole face from an existing facial position map data. We train a lightweight MobileNet-based regressor model with the generated data. As we do not have access to any evaluation dataset with dense landmarks in it we evaluate our model against the 68 keypoint detection task. Experimental results show that our trained model outperforms many of the existing methods in spite of its lower model size and minimal computational cost. Also, the qualitative evaluation shows the efficiency of our trained models in extreme head pose angles as well as other facial variations and occlusions.
GRMay 7
Reality Check: How Avatar and Face Representation Affect the Perceptual Evaluation of Synthesized GesturesHaoyang Du, Yinghan Xu, John Dingliana et al.
The capacity to create realistic virtual humans has progressed significantly, and such characters can be found in many applications across entertainment, education and health. As an essential element of interactive virtual humans, speech-driven 3D gesture generation still depends heavily on perceptual evaluation, yet studies often vary avatar appearance and facial presentation when judging the generated motions. Prior work suggests these visual choices can bias motion judgments, but controlled evidence remains limited. We address this gap with controlled evaluations of co-speech gestures across motion sources, spanning seven representative avatar renderings used in contemporary research and application pipelines. Our results show that avatar and face presentation systematically shift perceptual judgments, and we provide recommendations for benchmarking gesture synthesis as well as for deploying virtual humans in human-facing applications.
HCOct 7, 2021
AppearanceRachel McDonnell, Bilge Mutlu
Socially interactive agents (SIAs) are no longer mere visions for future user interfaces, as 20 years of research and technology development has enabled the use of virtual and physical agents in day-to-day interfaces and environments. This chapter of the ACM "The Handbook on Socially Interactive Agents" reviews research on and technologies involving socially interactive agents, including virtually embodied agents and physically embodied robots, focusing particularly on the appearance of socially interactive agents. It covers the history of the development of these technologies; outlines the design space for the appearance of agents, including what appearance comprises, modalities in which agents are presented, and how agents are constructed; and the features that agents use to support social interaction, including facial and bodily features, those that express demographic characteristics, and issues surrounding realism, appeal, and the uncanny valley. The chapter concludes with a brief discussion of open questions surrounding the appearance of socially interactive agents.
HCMar 4, 2021
It's A Match! Gesture Generation Using Expressive Parameter MatchingYlva Ferstl, Michael Neff, Rachel McDonnell
Automatic gesture generation from speech generally relies on implicit modelling of the nondeterministic speech-gesture relationship and can result in averaged motion lacking defined form. Here, we propose a database-driven approach of selecting gestures based on specific motion characteristics that have been shown to be associated with the speech audio. We extend previous work that identified expressive parameters of gesture motion that can both be predicted from speech and are perceptually important for a good speech-gesture match, such as gesture velocity and finger extension. A perceptual study was performed to evaluate the appropriateness of the gestures selected with our method. We compare our method with two baseline selection methods. The first respects timing, the desired onset and duration of a gesture, but does not match gesture form in other ways. The second baseline additionally disregards the original gesture timing for selecting gestures. The gesture sequences from our method were rated as a significantly better match to the speech than gestures selected by either baseline method.
HCOct 2, 2020
Understanding the Predictability of Gesture Parameters from Speech and their Perceptual ImportanceYlva Ferstl, Michael Neff, Rachel McDonnell
Gesture behavior is a natural part of human conversation. Much work has focused on removing the need for tedious hand-animation to create embodied conversational agents by designing speech-driven gesture generators. However, these generators often work in a black-box manner, assuming a general relationship between input speech and output motion. As their success remains limited, we investigate in more detail how speech may relate to different aspects of gesture motion. We determine a number of parameters characterizing gesture, such as speed and gesture size, and explore their relationship to the speech signal in a two-fold manner. First, we train multiple recurrent networks to predict the gesture parameters from speech to understand how well gesture attributes can be modeled from speech alone. We find that gesture parameters can be partially predicted from speech, and some parameters, such as path length, being predicted more accurately than others, like velocity. Second, we design a perceptual study to assess the importance of each gesture parameter for producing motion that people perceive as appropriate for the speech. Results show that a degradation in any parameter was viewed negatively, but some changes, such as hand shape, are more impactful than others. A video summarization can be found at https://youtu.be/aw6-_5kmLjY.
CVJun 21, 2020
Methodology for Building Synthetic Datasets with Virtual HumansShubhajit Basak, Hossein Javidnia, Faisal Khan et al.
Recent advances in deep learning methods have increased the performance of face detection and recognition systems. The accuracy of these models relies on the range of variation provided in the training data. Creating a dataset that represents all variations of real-world faces is not feasible as the control over the quality of the data decreases with the size of the dataset. Repeatability of data is another challenge as it is not possible to exactly recreate 'real-world' acquisition conditions outside of the laboratory. In this work, we explore a framework to synthetically generate facial data to be used as part of a toolchain to generate very large facial datasets with a high degree of control over facial and environmental variations. Such large datasets can be used for improved, targeted training of deep neural networks. In particular, we make use of a 3D morphable face model for the rendering of multiple 2D images across a dataset of 100 synthetic identities, providing full control over image variations such as pose, illumination, and background.