SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
This addresses the need for a unified model to handle various speech generation and transformation tasks, including noisy conditions, for applications in speech processing and AI.
The paper tackles the problem of limited versatility in generative speech models for diverse audio-text tasks, introducing SpeechX which achieves comparable or superior performance to specialized models in tasks like zero-shot TTS, noise suppression, and speech editing.
Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. See https://aka.ms/speechx for demo samples.