CVApr 18, 2022

Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion

arXiv:2204.08451v1148 citationsh-index: 156
Originality Incremental advance
AI Analysis

This work addresses the challenge of realistic nonverbal interaction modeling for applications in virtual agents and human-computer interaction, representing an incremental advance with novel method components.

The paper tackles the problem of modeling non-deterministic listener facial motion in dyadic conversations by developing a framework that autoregressively outputs multiple possibilities from multimodal speaker inputs, combining motion and audio with a cross-attention transformer and using a motion-encoding VQ-VAE for non-deterministic prediction. It demonstrates outperforming baselines qualitatively and quantitatively and introduces a large in-the-wild dataset.

We present a framework for modeling interactional communication in dyadic conversations: given multimodal inputs of a speaker, we autoregressively output multiple possibilities of corresponding listener motion. We combine the motion and speech audio of the speaker using a motion-audio cross attention transformer. Furthermore, we enable non-deterministic prediction by learning a discrete latent representation of realistic listener motion with a novel motion-encoding VQ-VAE. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions. Moreover, it produces realistic 3D listener facial motion synchronous with the speaker (see video). We demonstrate that our method outperforms baselines qualitatively and quantitatively via a rich suite of experiments. To facilitate this line of research, we introduce a novel and large in-the-wild dataset of dyadic conversations. Code, data, and videos available at https://evonneng.github.io/learning2listen/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes