ASCLSDSep 17, 2024

Enhancing Code-switched Text-to-Speech Synthesis Capability in Large Language Models with only Monolingual Corpora

arXiv:2409.10969v2h-index: 2
Originality Incremental advance
AI Analysis

This addresses the challenge of code-switched speech synthesis for multilingual applications, but it is incremental as it builds on existing LLM capabilities with a novel data construction strategy.

The paper tackles the problem of limited code-switched text-to-speech synthesis in large language models by proposing a method that uses only monolingual corpora, resulting in improved performance in naturalness, speaker consistency, and similarity compared to baselines.

While Large Language Models (LLMs) have shown potential in speech generation and recognition, their applications are mainly confined to monolingual scenarios, with limited explorations in code-switched (CS) contexts. In this paper, we propose a Code-Switched Large Language Model (CS-LLM) to enhance the code-switched text-to-speech synthesis (CS TTS) capability in LLMs with only monolingual corpora. Specifically, we begin by enhancing the multilingual speech processing ability of LLMs through multilingual speech recognition and synthesis tasks. Then, we develop an effective code-switched (CS) data construction strategy that splits and concatenates words from different monolingual speech corpora to equip LLMs with improved CS TTS ability. Experiments show that our approach outperforms baselines in CS TTS in terms of naturalness, speaker consistency and similarity even with limited data. Additionally, the constructed CS data further improves multilingual speech synthesis and recognition.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes