HC AI CLSep 18, 2025

Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech

Taesoo Kim, Yongsik Jo, Hyunmin Song, Taehwan Kim

arXiv:2509.14627v17.21 citationsh-index: 1Has CodeINTERSPEECH

Originality Incremental advance

AI Analysis

This work addresses the challenge of making AI conversational agents more engaging and human-like for users, though it is incremental as it builds on existing multimodal LLMs by focusing on speech generation.

The paper tackles the problem of generating natural and engaging speech in multimodal conversational agents by proposing a model that uses visual and audio modalities to produce speech responses based on conversation mood and style, demonstrating effectiveness in creating more human-like interactions.

Human conversation involves language, speech, and visual cues, with each medium providing complementary information. For instance, speech conveys a vibe or tone not fully captured by text alone. While multimodal LLMs focus on generating text responses from diverse inputs, less attention has been paid to generating natural and engaging speech. We propose a human-like agent that generates speech responses based on conversation mood and responsive style information. To achieve this, we build a novel MultiSensory Conversation dataset focused on speech to enable agents to generate natural speech. We then propose a multimodal LLM-based model for generating text responses and voice descriptions, which are used to generate speech covering paralinguistic information. Experimental results demonstrate the effectiveness of utilizing both visual and audio modalities in conversation to generate engaging speech. The source code is available in https://github.com/kimtaesu24/MSenC

View on arXiv PDF Code

Similar