CVApr 18, 2022

Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion

Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, Shiry Ginosar

arXiv:2204.08451v130.6153 citationsh-index: 156

Originality Incremental advance

AI Analysis

This work addresses the challenge of realistic nonverbal interaction modeling for applications in virtual agents and human-computer interaction, representing an incremental advance with novel method components.

The paper tackles the problem of modeling non-deterministic listener facial motion in dyadic conversations by developing a framework that autoregressively outputs multiple possibilities from multimodal speaker inputs, combining motion and audio with a cross-attention transformer and using a motion-encoding VQ-VAE for non-deterministic prediction. It demonstrates outperforming baselines qualitatively and quantitatively and introduces a large in-the-wild dataset.

We present a framework for modeling interactional communication in dyadic conversations: given multimodal inputs of a speaker, we autoregressively output multiple possibilities of corresponding listener motion. We combine the motion and speech audio of the speaker using a motion-audio cross attention transformer. Furthermore, we enable non-deterministic prediction by learning a discrete latent representation of realistic listener motion with a novel motion-encoding VQ-VAE. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions. Moreover, it produces realistic 3D listener facial motion synchronous with the speaker (see video). We demonstrate that our method outperforms baselines qualitatively and quantitatively via a rich suite of experiments. To facilitate this line of research, we introduce a novel and large in-the-wild dataset of dyadic conversations. Code, data, and videos available at https://evonneng.github.io/learning2listen/.

View on arXiv PDF

Similar