CVAIHCJun 12, 2024

Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

arXiv:2406.07867v240 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of building avatar chatbot systems for more natural human-computer interaction, representing an incremental step by adapting existing language models to a new multimodal domain.

The paper tackles the problem of creating a face-to-face spoken dialogue model that processes audio-visual speech input and generates audio-visual responses without intermediate text, achieving this by introducing a novel model and a large-scale multimodal corpus called MultiDialog with 340 hours of approximately 9,000 dialogues.

In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system without relying on intermediate text. To this end, we newly introduce MultiDialog, the first large-scale multimodal (i.e., audio and visual) spoken dialogue corpus containing 340 hours of approximately 9,000 dialogues, recorded based on the open domain dialogue dataset, TopicalChat. The MultiDialog contains parallel audio-visual recordings of conversation partners acting according to the given script with emotion annotations, which we expect to open up research opportunities in multimodal synthesis. Our Face-to-Face spoken dialogue model incorporates a textually pretrained large language model and adapts it into the audio-visual spoken dialogue domain by incorporating speech-text joint pretraining. Through extensive experiments, we validate the effectiveness of our model in facilitating a face-to-face conversation. Demo and data are available at https://multidialog.github.io and https://huggingface.co/datasets/IVLLab/MultiDialog, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes