HUMBO: Bridging Response Generation and Facial Expression Synthesis
This addresses the need for multimodal interaction in virtual assistants, which is an incremental step beyond existing text- or voice-only systems.
The paper tackles the problem of creating more human-like virtual assistants by introducing HUMBO, a system that generates dialogue responses and synthesizes corresponding facial expressions from a single user-provided image, enabling coherent emotional utterances and visual expressions.
Spoken dialogue systems that assist users to solve complex tasks such as movie ticket booking have become an emerging research topic in artificial intelligence and natural language processing areas. With a well-designed dialogue system as an intelligent personal assistant, people can accomplish certain tasks more easily via natural language interactions. Today there are several virtual intelligent assistants in the market; however, most systems only focus on textual or vocal interaction. In this paper, we present HUMBO, a system aiming at generating dialogue responses and simultaneously synthesize corresponding visual expressions on faces for better multimodal interaction. HUMBO can (1) let users determine the appearances of virtual assistants by a single image, and (2) generate coherent emotional utterances and facial expressions on the user-provided image. This is not only a brand new research direction but more importantly, an ultimate step toward more human-like virtual assistants.