CLAISDASOct 27, 2024

Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation

arXiv:2410.20336v16 citationsh-index: 6ICASSP
Originality Incremental advance
AI Analysis

This work addresses the under-explored challenge of enabling large language models to generate speech, which could benefit applications in multimodal dialog systems and speech synthesis.

The authors tackled the problem of extending text-dominant large language models to speech generation by introducing TTS-Llama, a fine-tuned Llama model for text-to-speech that achieves state-of-the-art performance, and MoLE-Llama, a multimodal model that maintains competitive performance in both text and speech tasks while mitigating catastrophic forgetting.

Large language models (LLMs) have revolutionized natural language processing (NLP) with impressive performance across various text-based tasks. However, the extension of text-dominant LLMs to with speech generation tasks remains under-explored. In this work, we introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance. Building on TTS-Llama, we further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture. Extensive empirical results demonstrate MoLE-Llama's competitive performance on both text-only question-answering (QA) and TTS tasks, mitigating catastrophic forgetting issue in either modality. Finally, we further explore MoLE-Llama in text-in-speech-out QA tasks, demonstrating its great potential as a multimodal dialog system capable of speech generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes