CL SD ASSep 20, 2023

Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model

arXiv:2309.11000v120.7126 citationsh-index: 5Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of creating more human-like spoken dialogue systems for applications in human-computer interaction, though it appears incremental as it builds on existing LLM capabilities.

This paper tackles the problem of building AI spoken dialogue systems by proposing a joint modeling approach for dialogue response and speech synthesis using Large Language Models (LLMs), showing that LLMs can handle prosodic structure prediction and integrate linguistic features, indicating a promising direction for unified systems.

This paper explores the potential of constructing an AI spoken dialogue system that "thinks how to respond" and "thinks how to speak" simultaneously, which more closely aligns with the human speech production process compared to the current cascade pipeline of independent chatbot and Text-to-Speech (TTS) modules. We hypothesize that Large Language Models (LLMs) with billions of parameters possess significant speech understanding capabilities and can jointly model dialogue responses and linguistic features. We conduct two sets of experiments: 1) Prosodic structure prediction, a typical front-end task in TTS, demonstrating the speech understanding ability of LLMs, and 2) Further integrating dialogue response and a wide array of linguistic features using a unified encoding format. Our results indicate that the LLM-based approach is a promising direction for building unified spoken dialogue systems.

View on arXiv PDF Code

Similar