Towards Multimodal Social Conversations with Robots: Using Vision-Language Models
This work outlines foundational needs for multimodal social robots, but it is incremental as it builds on existing vision-language models without presenting new experimental results.
The paper addresses the lack of multimodal capabilities in social robots for open-domain conversations, proposing that vision-language models can process visual information to enable more natural social interactions.
Large language models have given social robots the ability to autonomously engage in open-domain conversations. However, they are still missing a fundamental social skill: making use of the multiple modalities that carry social interactions. While previous work has focused on task-oriented interactions that require referencing the environment or specific phenomena in social interactions such as dialogue breakdowns, we outline the overall needs of a multimodal system for social conversations with robots. We then argue that vision-language models are able to process this wide range of visual information in a sufficiently general manner for autonomous social robots. We describe how to adapt them to this setting, which technical challenges remain, and briefly discuss evaluation practices.