CLSDASSep 20, 2023

Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model

arXiv:2309.11000v1126 citationsh-index: 5Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of creating more human-like spoken dialogue systems for applications in human-computer interaction, though it appears incremental as it builds on existing LLM capabilities.

This paper tackles the problem of building AI spoken dialogue systems by proposing a joint modeling approach for dialogue response and speech synthesis using Large Language Models (LLMs), showing that LLMs can handle prosodic structure prediction and integrate linguistic features, indicating a promising direction for unified systems.

This paper explores the potential of constructing an AI spoken dialogue system that "thinks how to respond" and "thinks how to speak" simultaneously, which more closely aligns with the human speech production process compared to the current cascade pipeline of independent chatbot and Text-to-Speech (TTS) modules. We hypothesize that Large Language Models (LLMs) with billions of parameters possess significant speech understanding capabilities and can jointly model dialogue responses and linguistic features. We conduct two sets of experiments: 1) Prosodic structure prediction, a typical front-end task in TTS, demonstrating the speech understanding ability of LLMs, and 2) Further integrating dialogue response and a wide array of linguistic features using a unified encoding format. Our results indicate that the LLM-based approach is a promising direction for building unified spoken dialogue systems.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes