Affective Faces for Goal-Driven Dyadic Communication
This work addresses the challenge of modeling multimodal communication in AI systems for applications like virtual agents or social robotics, though it is incremental in building on existing models.
The paper tackles the problem of generating socially appropriate facial expressions for a listener in dyadic conversations based on the speaker's speech, using a framework that combines large language models and vision-language models. The result is a system that outputs listeners significantly more socially appropriate than baselines, as demonstrated through experiments and visualizations.
We introduce a video framework for modeling the association between verbal and non-verbal communication during dyadic conversation. Given the input speech of a speaker, our approach retrieves a video of a listener, who has facial expressions that would be socially appropriate given the context. Our approach further allows the listener to be conditioned on their own goals, personalities, or backgrounds. Our approach models conversations through a composition of large language models and vision-language models, creating internal representations that are interpretable and controllable. To study multimodal communication, we propose a new video dataset of unscripted conversations covering diverse topics and demographics. Experiments and visualizations show our approach is able to output listeners that are significantly more socially appropriate than baselines. However, many challenges remain, and we release our dataset publicly to spur further progress. See our website for video results, data, and code: https://realtalk.cs.columbia.edu.